BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

bioothod · 2018-06-29T12:09:48Z

Export to golang as well.

If you want to implement strict processing of graph on specified device (for example one GPU among multiple processors), one must bind graph or separate operations to that device. Only C++ API somewhat support that, this patch makes it easier to use and creates helper functions for C and golang.

N.B. no matter which settings you use, currently (master of jun 27 and 1.6.0 release) all golang inference happens on CPU. Before 1.6 one could play with GPUConfig.VisibleDeeviceList, but it crashes currently (there are some reasons for that), and anyway always binds to CPU. So, the side effect of this patch is that one can not only tune GPU execution but turn if on again.

all its operations to be prcessed on that device. Export to golang as well.

bioothod · 2018-06-29T12:34:14Z

crash issue: #19083
resolution for c++: #18861

asimshankar · 2018-07-11T18:00:05Z

Thanks for the PR @bioothod , but I'd like to understand more of your use case first.

Firstly, could you describe the issue you were running into with using the GPU? By default, if linked with the GPU-enabled version of the TensorFlow C library, the Go program should automatically select a GPU for kernels where this is appropriate. If that is not happening, it seems like something we should fix.

Secondly, the graph construction API as it stands intentionally does not allow nodes to be mutated after they are added to the graph, as this may lead to confusing behavior (e.g., if two Sessions are created on the same graph and the graph is modified between the two). The TF_BindToDevice function added here is a mutation interface, so I'm a little weary of it. You'll notice that the workaround suggested operated on the GraphDef protocol buffer (adding the device name to nodes there) instead of operating on the in-memory representation of the graph.

Looking forward to your response.

bioothod · 2018-07-11T18:29:05Z

Firstly, could you describe the issue you were running into with using the GPU? By default, if linked with the GPU-enabled version of the TensorFlow C library, the Go program should automatically select a GPU for kernels where this is appropriate. If that is not happening, it seems like something we should fix.

It should, but it does not. I have a pretty simple golang application, which uses default session options, and while I can list all gpu devices on the board, session is never being bound to any of them. It changed probably around 1.6, but I can not say for sure. Literally the same code binds to random gpus on 1.4.1 and it does not on 1.6.0 or recent master. Previously I used nvidia-smi for example to monitor GPU load and saw (and now do not) load on one or another card. If I use custom session option with LogDevicePlacement=true, then I see that session is always placed on CPU on 1.6.0.

But it is only part of the problem, I also want to run particular session on particular GPU device among several available. Previously I could tune VisibleDeviceList in gpu part of session options, and things worked great. But after 1.6.0 you decided that multiple virtual mapping into the same physical device might confuse some operations (do not know how is it ever possible, but still), so golang code panics now if you ever touch VisibleDeviceList with something different than CUDA_VISIBLE_DEVICES (or empty string).

Hence this patch - now I can create multiple graphs each of them bind to own gpu device and select them when creating new session according to my policies. GraphDef solution is quite heavy for many sessions, and anyway c++ code has ability to access graph nodes, and TF_BindToDevice is just a nice wrapper to allow that same functionality. As long as there is a way to bind graph to device (that's what new golang function ImportWithDevice() does), its ok with me to hide details, but there is no easy way to iterate over graph nodes in go bindings, so there is a wrapper function.

asimshankar · 2018-07-11T20:02:24Z

Thanks for the note. Let's separate the two issues.

The GPU should work fine in 1.6+. I just tried with 1.9 and it does seem to work.
See https://gist.github.com/asimshankar/ef367e4897e248466c42c2dc629814e0

Particularly the lines in the output:

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0

If the same is not happening in the setup you're running it, it's worth investigating. So this PR shouldn't be required to enable use of the GPU.

That said, I appreciate that running the same model on multiple devices isn't as smooth as it should be and that experience can be improved. However, I don't think that this approach is the best one. As I mentioned earlier, I'm weary of mutations to the graph since that can lead to confusing behavior with multiple sessions. Furthermore, the implementation here is forcing every node to run on GPU - which will be problematic if the graph has nodes that only have CPU kernels (unless the user set soft_device_placement=True). Thus, I don't think this works out as an appropriate general purpose solution (though it may work out just fine for specific models).

In both cases (whether the approach in this PR, or via having the process that writes out the graph to import) we're creating multiple sessions, each with their own copy of the graph.

One somewhat ugly workaround is to have the program that creates the graph create a single saved model, with one tag per GPU. Then the Go program can create one session per GPU by providing the right tag to LoadSavedModel. Though, yeah, I admit this is a bit cumbersome. Another possibly cleaner option would be to have ImportGraphDef take a device specification and use that (similar to how tf.import_graph_def in Python respects the device stack). Though, that will be a bit more work.

Thanks for your understanding.

bioothod · 2018-07-12T14:47:25Z

That's not quite what I'm working with, problem with GPU placement happens when you load new graph from protobuf and run in in go.

I've made a simple example repo to highlight the problem: https://github.com/bioothod/golang_gpu_example

If you clone it into $GOPATH/src and run following commands, then you will find the problem:

$ cd golang_gpu_example
$ python3 ./make_graph.py --output test.pb 
$ go build
$ ./gpu ./test.pb 
2018-07-12 15:40:27.892564: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2018-07-12 15:40:28.025248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 9.21GiB
2018-07-12 15:40:28.025276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-12 15:40:28.212742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-12 15:40:28.212771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-07-12 15:40:28.212778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-07-12 15:40:28.212941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8900 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1
2018-07-12 15:40:28.294047: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1

output/op: (MatMul): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294321: I tensorflow/core/common_runtime/placer.cc:886] output/op: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0
input/ph0: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294335: I tensorflow/core/common_runtime/placer.cc:886] input/ph0: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
input/ph1: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294341: I tensorflow/core/common_runtime/placer.cc:886] input/ph1: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
op: [[3 4 5] [6 8 10] [9 12 15]]

This happens not all the time though, with this particular graph it is always on CPU, but op = tf.reduce_sum(ph0 * ph1, axis=1, name='output/op') runs on GPU. With my rather large graphs it is always on CPU.

Forcing graph to run on GPU with CPU-only kernels should not be a problem - it is not a real force, but only a hint, and it would be rather great if TF emits some kind of warning in this case. Yet it is MUCH better than running on CPU when all the kernels do have GPU implementations.

bioothod · 2018-08-01T10:51:04Z

Are there any question you might have concerning this issue? Does my code highlight the problem in your environment?

asimshankar · 2018-08-02T15:21:56Z

@bioothod - sorry, I missed your last update. Will take a look at your example soon.

bioothod · 2018-08-10T11:55:48Z

@asimshankar, still no progress on this?

asimshankar · 2018-08-10T14:43:29Z

@bioothod - sorry for delay. I'm traveling right now, but will definitely respond by Tuesday.

That said, one quick observation in your example - it seems the graph is operating on int32 tensors - for which there are few GPU kernels. Out of curiosity, does your application require int32, or was that just for this demo program? I suspect behavior will be quite different for other types like float or int64

asimshankar · 2018-08-14T07:59:03Z

@bioothod : Took a look at your example and had some comments/observations.

As mentioned above, the story will be much different if you're using types other than int32 (for example, with float32). Support for 32-bit integer operations on GPU is unfortunately a bit iffy in TensorFlow at the moment and the log message you're observing might be misleading as in both cases the operation is actually executing on CPU. The MatMul kernel does not have a GPU implementation for int32 and thus the device annotation is ignored (since you set AllowSoftPlacement: true). The Sum kernel (from tf.reduce_sum) requires the input and output for the int32 type to be in host memory (not device memory) and also just executes on CPU (the log messaging is misleading there because of the hack mentioned in the comment of the kernel registration).

Long story short, the BindToDevice call in your example has no real effect (the unfortunately misleading log message aside).

Furthermore, going back to my original reservation to adding this mutation to the graph - we've consciously avoided C APIs to mutate existing nodes in the graph as it can be hard to determine whether or not the mutations apply correctly. For example, consider the following:

TF_Graph* graph = MakeMyGraph();
TF_BindToDevice(graph, "/cpu:0");
TF_Session* session = TF_NewSession(graph, ...);
TF_SessionRun(session, ...);
TF_BindToDevice(graph, "/gpu:0");
TF_SessionRun(session, ...);

In this snippet, the second call to TF_BindToDevice has no effect, and all operations will execute on CPU. I'm hesitant to add a function which will have such an esoteric and hard to explain contract.

Alternatives would be the following:

Do this binding as an option to TF_ImportGraphDef, so it applies only to nodes being imported.
Rewrite the GraphDef proto before importing (see similar snippet for Java)
As suggested in the StackOverflow question above, somehow add a notion of virtual devices so that each Session can map the device in the graph to a different device at runtime.

Sound reasonable?

bioothod · 2018-08-15T17:52:36Z

Thank you for response @asimshankar !

I have to disagree - using TF_BindToDevice allows to bind whole graph to specified device and it will be executed there. I can not argue about int32 operations, but I do see that my large graphs are not executed on the GPU at all currently without this patch. So, basically, currently TF c/c++/go bindings are broken, all graphs are always executed on CPU.

With this patch nodes (all or only float32 for example, I can not say for sure) are executed on GPU. This is being confirmed not only by device placement log, which you say is misleading (and that's a bug too imho), but also HW monitoring tools (like nvidia-smi).

It could be interesting to check float32 operations (I'm currently away from the servers and can not run similar float32 ops test), but what's the point? We have a graph which is supposed to be executed on GPU by documentation, and it is not. Hence the patch.

As you said, second call for TF_BindToDevice will not replace graph if session has been created already, but it should not be a problem, TF_BindToDevice is not a strict order but a suggestion, a hint for execution, which can be clearly stated in the documentation with the example you have made.

Session config already had this virtual-physical device mapping, and TF developers decided that it must not be used at all - at least go bindings crash if visible device list does not match per-process device list (CUDA_VISIBLE_DEVICES for instance) or is not empty. Logic that multiple virtual devices will point to the same physical remains, although I personally never saw such operations that require changing physical device properties.

TF_BindToDevice is actually just a helper, if you are strongly against adding it (with appropriate documentation), I think updating TF_GraphImportGraphDef will work too, although a bit more clumsy.

So, let the solution be to extend TF_GraphImportGraphDef?

asimshankar · 2018-08-16T07:15:48Z

@bioothod : You say that "So, basically, currently TF c/c++/go bindings are broken, all graphs are always executed on CPU.", but this is not true :). One can certainly execute graphs on GPUs from Go, as demonstrated in the gist I previously linked to, and when I change the placeholders to be tf.float32 instead of tf.int32 in your example (https://github.com/bioothod/golang_gpu_example).

Regarding TF_BindToDevice - yes we could just add it with documentation trying to explain that it may or may not work. But that seems like a bad API as there is no way for potential callers to know whether or not the call will have an effect. For example, if I write some utility library which invokes TF_BindToDevice - whether or not it has an effect depends on the environment in which the caller of my library function invoked my library function. This will make it very hard to reason about.

I'm not opposed to adding to TF_GraphToImportGraphDef yet (haven't thought it through). But, the fact that you aren't able to use the GPU without this explicit binding (as you said "We have a graph which is supposed to be executed on GPU by documentation, and it is not.") seems like it is some other issue which we should get to the bottom of.

Can you provide an example of such a graph?

tensorflowbutler · 2018-09-06T18:55:38Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this PR still valid? Assigning the stalled label. Please comment to reassure me that this is still being worked on.

bioothod · 2018-09-09T15:15:57Z

Sorry for long delays, I will send updated patch for review soon, some other things have distracted me from this task, but I haven't given up on it.

…o BindToDevice_implementation

…s to bind newly created graph, drop TF_BindToDevice(), since (re)binding at runtime is being frown upon upstream

tensorflow/core/graph/graph_constructor.h

asimshankar

One more std::string, but otherwise looks good. thanks!

bioothod · 2018-10-23T19:58:21Z

Updated patch, but it looks like this reset the merge :)

Sorry. And thank you!

PiperOrigin-RevId: 218565776

asimshankar · 2018-11-27T02:58:14Z

I intend to temporarily disable this to resolve #23257
And then re-enable after a release that includes this PR in the C library has been made.

(That way, "go get" works by default and some code tweaks are needed to pin devices from Go, instead of the other way around where "go get" fails by default and "git checkout" it needed to make it work)

bioothod · 2018-11-27T15:32:27Z

Yeah, I read that issue, does not really know how to solve this kind of a problem, maybe only by manually splitting C and other parts and only merging golang/python/whatever after C part has been released

By temporarily disabling the features introduced in PR #20412 Will re-enable after the C libraries for TensorFlow 1.13 or later have been released. PiperOrigin-RevId: 223079148

asimshankar · 2018-11-28T02:43:03Z

Once the dust settles on go dep / go mod, perhaps we can utilize that for versioning.

This hasn't happened often enough to be troublesome, so I'm okay with some manual work for now.

bioothod · 2019-01-29T16:54:02Z

@asimshankar Default execution device is still disabled in golang API, should it be uncommented now as 1.13.0-rc0 is out?

asimshankar · 2019-01-30T06:27:46Z

@bioothod : May be best to wait till 1.13.0 final is out instead of relying on the RC?

bioothod · 2019-02-08T14:40:49Z

Fair enough, let's wait for 1.13 release to roll out

bioothod · 2019-04-03T16:56:12Z

@asimshankar hi, is it time to merge go changes which were commented in 6f09a09

although documentation for C library build still references 1.12

asimshankar · 2019-04-03T22:04:51Z

Yeah, I think so. @jhseu @alextp would be happy to help that through if you could send a PR (I'm no longer actively working on TensorFlow). Thanks!

alextp · 2019-04-03T23:13:47Z

Yes, happy to review and help merge a PR

bioothod · 2019-04-16T13:23:00Z

This is it: #27891
I've made it against r.13, but it should be applicable to master too.

BindToDevice() bind graph to specified (gpu) device which forces

3f89942

all its operations to be prcessed on that device. Export to golang as well.

googlebot added the cla: yes label Jun 29, 2018

bioothod changed the title ~~BindToDevice() bind graph to specified (gpu) device which forces all its operations to be prcessed on that device.~~ BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. Jun 29, 2018

tensorflowbutler assigned yifeif Jun 30, 2018

yifeif requested a review from asimshankar July 11, 2018 17:22

yifeif added the awaiting review Pull request awaiting review label Jul 11, 2018

tensorflowbutler removed the awaiting review Pull request awaiting review label Jul 12, 2018

asimshankar added the stat:awaiting response Status - Awaiting response from author label Aug 14, 2018

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 16, 2018

asimshankar added the stat:awaiting response Status - Awaiting response from author label Aug 23, 2018

tensorflowbutler added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 6, 2018

tensorflowbutler added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting response Status - Awaiting response from author stale This label marks the issue/pr stale - to be closed automatically if no activity labels Sep 10, 2018

bioothod added 2 commits September 17, 2018 14:31

Merge branch 'master' of https://github.com/tensorflow/tensorflow int…

c8c401f

…o BindToDevice_implementation

Add new TF_GraphImportGraphDefWithDevice() function, which only allow…

609283c

…s to bind newly created graph, drop TF_BindToDevice(), since (re)binding at runtime is being frown upon upstream

bioothod added 2 commits October 23, 2018 12:37

Minor refactor

649639d

Simplify test to explicitly show which node should have which device

d0e05f9

asimshankar reviewed Oct 23, 2018

View reviewed changes

tensorflow/core/graph/graph_constructor.h Outdated Show resolved Hide resolved

asimshankar previously approved these changes Oct 23, 2018

View reviewed changes

asimshankar added the ready to pull PR ready for merge process label Oct 23, 2018

tensorflowbutler removed ready to pull PR ready for merge process stat:awaiting response Status - Awaiting response from author labels Oct 23, 2018

Use string instead of std::string

77ff33a

bioothod dismissed asimshankar’s stale review via 77ff33a October 23, 2018 19:56

asimshankar added ready to pull PR ready for merge process kokoro:force-run Tests on submitted change labels Oct 23, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 23, 2018

tensorflow-copybara merged commit 77ff33a into tensorflow:master Oct 24, 2018

tensorflow-copybara pushed a commit that referenced this pull request Oct 24, 2018

Merge pull request #20412 from bioothod:BindToDevice_implementation

e0af5ac

PiperOrigin-RevId: 218565776

tensorflow-copybara pushed a commit that referenced this pull request Nov 28, 2018

[Go]: Fix #23257

6f09a09

By temporarily disabling the features introduced in PR #20412 Will re-enable after the C libraries for TensorFlow 1.13 or later have been released. PiperOrigin-RevId: 223079148

rthadur added this to Assigned Reviewer in PR Queue via automation Apr 7, 2019

rthadur self-assigned this Apr 7, 2019

bioothod mentioned this pull request Apr 16, 2019

Uncomment device binding option, which allows Go application to bind graph processing to specified GPU device(s) #27891

Closed

bioothod mentioned this pull request May 16, 2019

Uncomment execution device option in golang bindings #28768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

bioothod commented Jun 29, 2018

bioothod commented Jun 29, 2018

asimshankar commented Jul 11, 2018 •

edited

bioothod commented Jul 11, 2018 •

edited

asimshankar commented Jul 11, 2018 •

edited

bioothod commented Jul 12, 2018 •

edited

bioothod commented Aug 1, 2018

asimshankar commented Aug 2, 2018

bioothod commented Aug 10, 2018

asimshankar commented Aug 10, 2018 •

edited

asimshankar commented Aug 14, 2018

bioothod commented Aug 15, 2018 •

edited

asimshankar commented Aug 16, 2018

tensorflowbutler commented Sep 6, 2018

bioothod commented Sep 9, 2018

asimshankar left a comment

bioothod commented Oct 23, 2018

asimshankar commented Nov 27, 2018

bioothod commented Nov 27, 2018

asimshankar commented Nov 28, 2018

bioothod commented Jan 29, 2019 •

edited

asimshankar commented Jan 30, 2019

bioothod commented Feb 8, 2019

bioothod commented Apr 3, 2019

asimshankar commented Apr 3, 2019

alextp commented Apr 3, 2019

bioothod commented Apr 16, 2019

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

Conversation

bioothod commented Jun 29, 2018

bioothod commented Jun 29, 2018

asimshankar commented Jul 11, 2018 • edited

bioothod commented Jul 11, 2018 • edited

asimshankar commented Jul 11, 2018 • edited

bioothod commented Jul 12, 2018 • edited

bioothod commented Aug 1, 2018

asimshankar commented Aug 2, 2018

bioothod commented Aug 10, 2018

asimshankar commented Aug 10, 2018 • edited

asimshankar commented Aug 14, 2018

bioothod commented Aug 15, 2018 • edited

asimshankar commented Aug 16, 2018

tensorflowbutler commented Sep 6, 2018

bioothod commented Sep 9, 2018

asimshankar left a comment

Choose a reason for hiding this comment

bioothod commented Oct 23, 2018

asimshankar commented Nov 27, 2018

bioothod commented Nov 27, 2018

asimshankar commented Nov 28, 2018

bioothod commented Jan 29, 2019 • edited

asimshankar commented Jan 30, 2019

bioothod commented Feb 8, 2019

bioothod commented Apr 3, 2019

asimshankar commented Apr 3, 2019

alextp commented Apr 3, 2019

bioothod commented Apr 16, 2019

asimshankar commented Jul 11, 2018 •

edited

bioothod commented Jul 11, 2018 •

edited

asimshankar commented Jul 11, 2018 •

edited

bioothod commented Jul 12, 2018 •

edited

asimshankar commented Aug 10, 2018 •

edited

bioothod commented Aug 15, 2018 •

edited

bioothod commented Jan 29, 2019 •

edited