Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

Merged

Conversation

bioothod
Copy link
Contributor

Export to golang as well.

If you want to implement strict processing of graph on specified device (for example one GPU among multiple processors), one must bind graph or separate operations to that device. Only C++ API somewhat support that, this patch makes it easier to use and creates helper functions for C and golang.

N.B. no matter which settings you use, currently (master of jun 27 and 1.6.0 release) all golang inference happens on CPU. Before 1.6 one could play with GPUConfig.VisibleDeeviceList, but it crashes currently (there are some reasons for that), and anyway always binds to CPU. So, the side effect of this patch is that one can not only tune GPU execution but turn if on again.

all its operations to be prcessed on that device.

Export to golang as well.
@bioothod bioothod changed the title BindToDevice() bind graph to specified (gpu) device which forces all its operations to be prcessed on that device. BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. Jun 29, 2018
@bioothod
Copy link
Contributor Author

crash issue: #19083
resolution for c++: #18861

@yifeif yifeif requested a review from asimshankar July 11, 2018 17:22
@yifeif yifeif added the awaiting review Pull request awaiting review label Jul 11, 2018
@asimshankar
Copy link
Contributor

asimshankar commented Jul 11, 2018

Thanks for the PR @bioothod , but I'd like to understand more of your use case first.

Firstly, could you describe the issue you were running into with using the GPU? By default, if linked with the GPU-enabled version of the TensorFlow C library, the Go program should automatically select a GPU for kernels where this is appropriate. If that is not happening, it seems like something we should fix.

Secondly, the graph construction API as it stands intentionally does not allow nodes to be mutated after they are added to the graph, as this may lead to confusing behavior (e.g., if two Sessions are created on the same graph and the graph is modified between the two). The TF_BindToDevice function added here is a mutation interface, so I'm a little weary of it. You'll notice that the workaround suggested operated on the GraphDef protocol buffer (adding the device name to nodes there) instead of operating on the in-memory representation of the graph.

Looking forward to your response.

@bioothod
Copy link
Contributor Author

bioothod commented Jul 11, 2018

Firstly, could you describe the issue you were running into with using the GPU? By default, if linked with the GPU-enabled version of the TensorFlow C library, the Go program should automatically select a GPU for kernels where this is appropriate. If that is not happening, it seems like something we should fix.

It should, but it does not. I have a pretty simple golang application, which uses default session options, and while I can list all gpu devices on the board, session is never being bound to any of them. It changed probably around 1.6, but I can not say for sure. Literally the same code binds to random gpus on 1.4.1 and it does not on 1.6.0 or recent master. Previously I used nvidia-smi for example to monitor GPU load and saw (and now do not) load on one or another card. If I use custom session option with LogDevicePlacement=true, then I see that session is always placed on CPU on 1.6.0.

But it is only part of the problem, I also want to run particular session on particular GPU device among several available. Previously I could tune VisibleDeviceList in gpu part of session options, and things worked great. But after 1.6.0 you decided that multiple virtual mapping into the same physical device might confuse some operations (do not know how is it ever possible, but still), so golang code panics now if you ever touch VisibleDeviceList with something different than CUDA_VISIBLE_DEVICES (or empty string).

Hence this patch - now I can create multiple graphs each of them bind to own gpu device and select them when creating new session according to my policies. GraphDef solution is quite heavy for many sessions, and anyway c++ code has ability to access graph nodes, and TF_BindToDevice is just a nice wrapper to allow that same functionality. As long as there is a way to bind graph to device (that's what new golang function ImportWithDevice() does), its ok with me to hide details, but there is no easy way to iterate over graph nodes in go bindings, so there is a wrapper function.

@asimshankar
Copy link
Contributor

asimshankar commented Jul 11, 2018

Thanks for the note. Let's separate the two issues.

The GPU should work fine in 1.6+. I just tried with 1.9 and it does seem to work.
See https://gist.github.com/asimshankar/ef367e4897e248466c42c2dc629814e0

Particularly the lines in the output:

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0

If the same is not happening in the setup you're running it, it's worth investigating. So this PR shouldn't be required to enable use of the GPU.

That said, I appreciate that running the same model on multiple devices isn't as smooth as it should be and that experience can be improved. However, I don't think that this approach is the best one. As I mentioned earlier, I'm weary of mutations to the graph since that can lead to confusing behavior with multiple sessions. Furthermore, the implementation here is forcing every node to run on GPU - which will be problematic if the graph has nodes that only have CPU kernels (unless the user set soft_device_placement=True). Thus, I don't think this works out as an appropriate general purpose solution (though it may work out just fine for specific models).

In both cases (whether the approach in this PR, or via having the process that writes out the graph to import) we're creating multiple sessions, each with their own copy of the graph.

One somewhat ugly workaround is to have the program that creates the graph create a single saved model, with one tag per GPU. Then the Go program can create one session per GPU by providing the right tag to LoadSavedModel. Though, yeah, I admit this is a bit cumbersome. Another possibly cleaner option would be to have ImportGraphDef take a device specification and use that (similar to how tf.import_graph_def in Python respects the device stack). Though, that will be a bit more work.

Thanks for your understanding.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jul 12, 2018
@bioothod
Copy link
Contributor Author

bioothod commented Jul 12, 2018

That's not quite what I'm working with, problem with GPU placement happens when you load new graph from protobuf and run in in go.

I've made a simple example repo to highlight the problem: https://github.com/bioothod/golang_gpu_example

If you clone it into $GOPATH/src and run following commands, then you will find the problem:

$ cd golang_gpu_example
$ python3 ./make_graph.py --output test.pb 
$ go build
$ ./gpu ./test.pb 
2018-07-12 15:40:27.892564: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2018-07-12 15:40:28.025248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 9.21GiB
2018-07-12 15:40:28.025276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-12 15:40:28.212742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-12 15:40:28.212771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-07-12 15:40:28.212778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-07-12 15:40:28.212941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8900 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1
2018-07-12 15:40:28.294047: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1

output/op: (MatMul): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294321: I tensorflow/core/common_runtime/placer.cc:886] output/op: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0
input/ph0: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294335: I tensorflow/core/common_runtime/placer.cc:886] input/ph0: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
input/ph1: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
2018-07-12 15:40:28.294341: I tensorflow/core/common_runtime/placer.cc:886] input/ph1: (Placeholder)/job:localhost/replica:0/task:0/device:CPU:0
op: [[3 4 5] [6 8 10] [9 12 15]]

This happens not all the time though, with this particular graph it is always on CPU, but op = tf.reduce_sum(ph0 * ph1, axis=1, name='output/op') runs on GPU. With my rather large graphs it is always on CPU.

Forcing graph to run on GPU with CPU-only kernels should not be a problem - it is not a real force, but only a hint, and it would be rather great if TF emits some kind of warning in this case. Yet it is MUCH better than running on CPU when all the kernels do have GPU implementations.

@bioothod
Copy link
Contributor Author

bioothod commented Aug 1, 2018

Are there any question you might have concerning this issue? Does my code highlight the problem in your environment?

@asimshankar
Copy link
Contributor

@bioothod - sorry, I missed your last update. Will take a look at your example soon.

@bioothod
Copy link
Contributor Author

@asimshankar, still no progress on this?

@asimshankar
Copy link
Contributor

asimshankar commented Aug 10, 2018

@bioothod - sorry for delay. I'm traveling right now, but will definitely respond by Tuesday.

That said, one quick observation in your example - it seems the graph is operating on int32 tensors - for which there are few GPU kernels. Out of curiosity, does your application require int32, or was that just for this demo program? I suspect behavior will be quite different for other types like float or int64

@asimshankar
Copy link
Contributor

@bioothod : Took a look at your example and had some comments/observations.

As mentioned above, the story will be much different if you're using types other than int32 (for example, with float32). Support for 32-bit integer operations on GPU is unfortunately a bit iffy in TensorFlow at the moment and the log message you're observing might be misleading as in both cases the operation is actually executing on CPU. The MatMul kernel does not have a GPU implementation for int32 and thus the device annotation is ignored (since you set AllowSoftPlacement: true). The Sum kernel (from tf.reduce_sum) requires the input and output for the int32 type to be in host memory (not device memory) and also just executes on CPU (the log messaging is misleading there because of the hack mentioned in the comment of the kernel registration).

Long story short, the BindToDevice call in your example has no real effect (the unfortunately misleading log message aside).

Furthermore, going back to my original reservation to adding this mutation to the graph - we've consciously avoided C APIs to mutate existing nodes in the graph as it can be hard to determine whether or not the mutations apply correctly. For example, consider the following:

TF_Graph* graph = MakeMyGraph();
TF_BindToDevice(graph, "/cpu:0");
TF_Session* session = TF_NewSession(graph, ...);
TF_SessionRun(session, ...);
TF_BindToDevice(graph, "/gpu:0");
TF_SessionRun(session, ...);

In this snippet, the second call to TF_BindToDevice has no effect, and all operations will execute on CPU. I'm hesitant to add a function which will have such an esoteric and hard to explain contract.

Alternatives would be the following:

  1. Do this binding as an option to TF_ImportGraphDef, so it applies only to nodes being imported.
  2. Rewrite the GraphDef proto before importing (see similar snippet for Java)
  3. As suggested in the StackOverflow question above, somehow add a notion of virtual devices so that each Session can map the device in the graph to a different device at runtime.

Sound reasonable?

@asimshankar asimshankar added the stat:awaiting response Status - Awaiting response from author label Aug 14, 2018
@bioothod
Copy link
Contributor Author

bioothod commented Aug 15, 2018

Thank you for response @asimshankar !

I have to disagree - using TF_BindToDevice allows to bind whole graph to specified device and it will be executed there. I can not argue about int32 operations, but I do see that my large graphs are not executed on the GPU at all currently without this patch. So, basically, currently TF c/c++/go bindings are broken, all graphs are always executed on CPU.

With this patch nodes (all or only float32 for example, I can not say for sure) are executed on GPU. This is being confirmed not only by device placement log, which you say is misleading (and that's a bug too imho), but also HW monitoring tools (like nvidia-smi).

It could be interesting to check float32 operations (I'm currently away from the servers and can not run similar float32 ops test), but what's the point? We have a graph which is supposed to be executed on GPU by documentation, and it is not. Hence the patch.

As you said, second call for TF_BindToDevice will not replace graph if session has been created already, but it should not be a problem, TF_BindToDevice is not a strict order but a suggestion, a hint for execution, which can be clearly stated in the documentation with the example you have made.

Session config already had this virtual-physical device mapping, and TF developers decided that it must not be used at all - at least go bindings crash if visible device list does not match per-process device list (CUDA_VISIBLE_DEVICES for instance) or is not empty. Logic that multiple virtual devices will point to the same physical remains, although I personally never saw such operations that require changing physical device properties.

TF_BindToDevice is actually just a helper, if you are strongly against adding it (with appropriate documentation), I think updating TF_GraphImportGraphDef will work too, although a bit more clumsy.

So, let the solution be to extend TF_GraphImportGraphDef?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 16, 2018
@asimshankar
Copy link
Contributor

@bioothod : You say that "So, basically, currently TF c/c++/go bindings are broken, all graphs are always executed on CPU.", but this is not true :). One can certainly execute graphs on GPUs from Go, as demonstrated in the gist I previously linked to, and when I change the placeholders to be tf.float32 instead of tf.int32 in your example (https://github.com/bioothod/golang_gpu_example).

Regarding TF_BindToDevice - yes we could just add it with documentation trying to explain that it may or may not work. But that seems like a bad API as there is no way for potential callers to know whether or not the call will have an effect. For example, if I write some utility library which invokes TF_BindToDevice - whether or not it has an effect depends on the environment in which the caller of my library function invoked my library function. This will make it very hard to reason about.

I'm not opposed to adding to TF_GraphToImportGraphDef yet (haven't thought it through). But, the fact that you aren't able to use the GPU without this explicit binding (as you said "We have a graph which is supposed to be executed on GPU by documentation, and it is not.") seems like it is some other issue which we should get to the bottom of.

Can you provide an example of such a graph?

@asimshankar asimshankar added the stat:awaiting response Status - Awaiting response from author label Aug 23, 2018
@tensorflowbutler tensorflowbutler added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Sep 6, 2018
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this PR still valid? Assigning the stalled label. Please comment to reassure me that this is still being worked on.

@bioothod
Copy link
Contributor Author

bioothod commented Sep 9, 2018

Sorry for long delays, I will send updated patch for review soon, some other things have distracted me from this task, but I haven't given up on it.

@tensorflowbutler tensorflowbutler added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting response Status - Awaiting response from author stale This label marks the issue/pr stale - to be closed automatically if no activity labels Sep 10, 2018
…s to bind newly created graph, drop TF_BindToDevice(), since (re)binding at runtime is being frown upon upstream
asimshankar
asimshankar previously approved these changes Oct 23, 2018
Copy link
Contributor

@asimshankar asimshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more std::string, but otherwise looks good. thanks!

@asimshankar asimshankar added the ready to pull PR ready for merge process label Oct 23, 2018
@tensorflowbutler tensorflowbutler removed ready to pull PR ready for merge process stat:awaiting response Status - Awaiting response from author labels Oct 23, 2018
@bioothod
Copy link
Contributor Author

Updated patch, but it looks like this reset the merge :)

Sorry. And thank you!

@asimshankar asimshankar added ready to pull PR ready for merge process kokoro:force-run Tests on submitted change labels Oct 23, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 23, 2018
@tensorflow-copybara tensorflow-copybara merged commit 77ff33a into tensorflow:master Oct 24, 2018
tensorflow-copybara pushed a commit that referenced this pull request Oct 24, 2018
@asimshankar
Copy link
Contributor

I intend to temporarily disable this to resolve #23257
And then re-enable after a release that includes this PR in the C library has been made.

(That way, "go get" works by default and some code tweaks are needed to pin devices from Go, instead of the other way around where "go get" fails by default and "git checkout" it needed to make it work)

@bioothod
Copy link
Contributor Author

Yeah, I read that issue, does not really know how to solve this kind of a problem, maybe only by manually splitting C and other parts and only merging golang/python/whatever after C part has been released

tensorflow-copybara pushed a commit that referenced this pull request Nov 28, 2018
By temporarily disabling the features introduced in
PR #20412

Will re-enable after the C libraries for TensorFlow 1.13 or later
have been released.

PiperOrigin-RevId: 223079148
@asimshankar
Copy link
Contributor

Once the dust settles on go dep / go mod, perhaps we can utilize that for versioning.

This hasn't happened often enough to be troublesome, so I'm okay with some manual work for now.

@bioothod
Copy link
Contributor Author

bioothod commented Jan 29, 2019

@asimshankar Default execution device is still disabled in golang API, should it be uncommented now as 1.13.0-rc0 is out?

@asimshankar
Copy link
Contributor

@bioothod : May be best to wait till 1.13.0 final is out instead of relying on the RC?

@bioothod
Copy link
Contributor Author

bioothod commented Feb 8, 2019

Fair enough, let's wait for 1.13 release to roll out

@bioothod
Copy link
Contributor Author

bioothod commented Apr 3, 2019

@asimshankar hi, is it time to merge go changes which were commented in 6f09a09

although documentation for C library build still references 1.12

@asimshankar
Copy link
Contributor

Yeah, I think so. @jhseu @alextp would be happy to help that through if you could send a PR (I'm no longer actively working on TensorFlow). Thanks!

@alextp
Copy link
Contributor

alextp commented Apr 3, 2019

Yes, happy to review and help merge a PR

@bioothod
Copy link
Contributor Author

This is it: #27891
I've made it against r.13, but it should be applicable to master too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
PR Queue
  
Assigned Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

10 participants