-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to run TensorFlow job on the preferred device #77
Comments
Sounds good @zaleslaw ! I suggest that we go with a better typed API, maybe using some sort of enum, instead of hardcoded string like this (e.g. something like |
I'm not very familiar with all possible device id strings, but it sounds
good for me.
пн, 22 июн. 2020 г. в 17:32, Karl Lessard <notifications@github.com>:
… Sounds good @zaleslaw <https://github.com/zaleslaw> !
I suggest that we go with a better typed API, maybe using some sort of
enum, instead of hardcoded string like this (e.g. something like
tf.withDevice(Device.GPU) or tf.withDevice(Device.GPU, 1), etc. Anyhow,
something we can define together as we go, wdyt?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJEUHPT3LKD6ERYZRFK4MLRX5TO5ANCNFSM4OET2PHA>
.
|
I'm in favour of an enum, and then an optional device ID number. We'll need it to specify what GPU something runs on, and I use shared GPU resources so I need to not contend with other users on different GPUs. |
Agree with Adam here: GPU/CPU mode with optional String IDs
пн, 22 июн. 2020 г. в 18:43, Adam Pocock <notifications@github.com>:
… I'm in favour of an enum, and then an optional device ID number. We'll
need it to specify what GPU something runs on, and I use shared GPU
resources so I need to not contend with other users on different GPUs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJEUHLZQEHI7JHWBCR4WSDRX533JANCNFSM4OET2PHA>
.
|
Yes, that was my proposal as well. I just want to make sure that we are not missing anything, I don't know if TensorFlow supports more options than simply |
The string also accepts host names and a lot of other things we'll have problems fitting in just enums: |
So probably some kind of builder could be more appropriate then? e.g. |
Drive by comment. Maybe look into existing python class? https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/device_spec.py |
I'm sending #83 for getting the DeviceSpec ported as a start. |
Great idea! Happy to help test this with a multi-GPU system (#131). Additional question: is there a command-line option to java to explicitly select which GPU should be used? GPU 0 is selected by default, but my system has 4 GPUs, so I'd like to run 4 java processes at once, with each java process explicitly selecting a different GPU. |
As expected, the CUDA_VISIBLE_DEVICES environment variable may be used to select GPU 1, for instance:
|
@aday00 and @karllessard looks like we have working device spec now, but have no ability to apply it for our graph or model (we have only setDevice option in GraphBuilder, but we could not pass it through the current TF operands and as a result we have no control on device yet.) Please, correct me if I'm wrong. |
Maybe in eager mode it's better move to runner, but for the static graph look like we need to pre-define it earlier |
Yet one problem with imported graphs and models. When and how we could bind the device spec? Maybe after import we should have a method on graph, like addGradients? But in reality we need to rebuild our graph with new device specs. I've ready to contribute in this ticket, but need some advices and tips before start |
Hey @zaleslaw , we should probably add the ability to add the So we can end up doing something like:
If you feel that the syntax for building a device spec can still be improved, feel free to update it, thanks! |
@karllessard Yep, ok, I will try to do it. I suppose it could be a few approaches (tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...) and alternative approach, where you could choose device on operation level, not on scope see below
for Abs operand for example |
The reason for doing it at the scope level is to limit the number of operation factories, otherwise you will double-up the number of endpoints we have in the |
I agree, that 2000 endpoints instead of 1000 is not so good, but the main idea of op device API has a very fine and detailed control over where operations are performed. |
Could you please explain what is the problem with endpoints increasing? Maybe I miss something in performance or this framework development? Correct me if I'm wrong, but I see this like changing in java endpoints generation code (not 2000 minor changes), but 1 in generation process. My idea to have two-level granularity (one operation and scope). Should keep in mind, that scope are not widely ised in our API including new API parts in framework. But if we have it in operation, it's could be easy expanded through each API (our API is operand-centric, not Scope) |
Well at some point we'll run out of space in the class file. |
I'd like to see that. Check what it looks like for MKL and it's still fine: |
Like we discussed during the call today, my main concern of adding a new factory, and therefore duplicate all ops endpoints, is the noise that will create in the IDE's auto-complete feature, plus the fact that |
Excited to see this feature develop! |
I'll experiment with different modes. But I agree that it could be useful
to make backprop and multiple matrices on different GPUs for different
branches in models like InceptionV3 and so on.
пн, 2 нояб. 2020 г. в 05:09, aday00 <notifications@github.com>:
… Excited to see this feature develop!
Will
tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...)
allow different parts of a neural network to be trained on different GPUs
simultaneously?
May some parts of the neural networks be trained on CPU and other parts on
GPU?
Happy to test this as well, when appropriate.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#77 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJEUHJNYHCUVF3DQ6UBON3SNYIGTANCNFSM4OET2PHA>
.
|
Sounds great @zaleslaw! For me, the models become so large that 1 GPU and 12GB GPU RAM is not enough to hold the entire model. Partitioning the model across 2 or more GPUs in a computer would make larger models work. |
System information
Describe the feature and the current behavior/state.
Now, we could run our code on GPU only via adding GPU dependencies to the classpath.
But the basic Python API provides an ability to set up the preferred device (GPU or CPU via device name)
The basic option also is available for low-level builder here
Will this change the current api? How?
Let's add the function
tf.withDevice(“/GPU:0”)
to theScope
class.Who will benefit with this feature?
Anyone who trains neural network in distributed mode on different GPU/CPU devices.
Any Other info.
The text was updated successfully, but these errors were encountered: