Add option to run TensorFlow job on the preferred device #77

zaleslaw · 2020-06-22T13:59:00Z

System information

TensorFlow version (you are using): 1.15 or 2.x
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Now, we could run our code on GPU only via adding GPU dependencies to the classpath.
But the basic Python API provides an ability to set up the preferred device (GPU or CPU via device name)

The basic option also is available for low-level builder here

Will this change the current api? How?

Let's add the function tf.withDevice(“/GPU:0”) to the Scope class.

Who will benefit with this feature?
Anyone who trains neural network in distributed mode on different GPU/CPU devices.

Any Other info.

The text was updated successfully, but these errors were encountered:

karllessard · 2020-06-22T14:31:59Z

Sounds good @zaleslaw !

I suggest that we go with a better typed API, maybe using some sort of enum, instead of hardcoded string like this (e.g. something like tf.withDevice(Device.GPU) or tf.withDevice(Device.GPU, 1), etc. Anyhow, something we can define together as we go, wdyt?

zaleslaw · 2020-06-22T14:45:01Z

I'm not very familiar with all possible device id strings, but it sounds good for me. пн, 22 июн. 2020 г. в 17:32, Karl Lessard <notifications@github.com>:

…

Sounds good @zaleslaw <https://github.com/zaleslaw> ! I suggest that we go with a better typed API, maybe using some sort of enum, instead of hardcoded string like this (e.g. something like tf.withDevice(Device.GPU) or tf.withDevice(Device.GPU, 1), etc. Anyhow, something we can define together as we go, wdyt? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJEUHPT3LKD6ERYZRFK4MLRX5TO5ANCNFSM4OET2PHA> .

Craigacp · 2020-06-22T15:43:33Z

I'm in favour of an enum, and then an optional device ID number. We'll need it to specify what GPU something runs on, and I use shared GPU resources so I need to not contend with other users on different GPUs.

zaleslaw · 2020-06-22T15:46:32Z

Agree with Adam here: GPU/CPU mode with optional String IDs пн, 22 июн. 2020 г. в 18:43, Adam Pocock <notifications@github.com>:

…

I'm in favour of an enum, and then an optional device ID number. We'll need it to specify what GPU something runs on, and I use shared GPU resources so I need to not contend with other users on different GPUs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJEUHLZQEHI7JHWBCR4WSDRX533JANCNFSM4OET2PHA> .

karllessard · 2020-06-22T17:58:57Z

Yes, that was my proposal as well. I just want to make sure that we are not missing anything, I don't know if TensorFlow supports more options than simply /<device>:<id> in their string format, can you please double check @zaleslaw ?

saudet · 2020-06-23T02:00:03Z

The string also accepts host names and a lot of other things we'll have problems fitting in just enums:
https://www.tensorflow.org/guide/gpu
https://www.tensorflow.org/guide/tpu

karllessard · 2020-06-23T02:30:56Z

So probably some kind of builder could be more appropriate then? e.g. Device.job("localhost").replica(0).task(0).gpu(1)

hthu · 2020-07-08T22:43:19Z

Drive by comment.

Maybe look into existing python class?

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/device_spec.py

hthu · 2020-07-12T23:54:02Z

I'm sending #83 for getting the DeviceSpec ported as a start.

aday00 · 2020-10-15T04:05:55Z

Great idea! Happy to help test this with a multi-GPU system (#131).

Additional question: is there a command-line option to java to explicitly select which GPU should be used? GPU 0 is selected by default, but my system has 4 GPUs, so I'd like to run 4 java processes at once, with each java process explicitly selecting a different GPU.

aday00 · 2020-10-15T23:11:00Z

Additional question: is there a command-line option to java to explicitly select which GPU should be used?

As expected, the CUDA_VISIBLE_DEVICES environment variable may be used to select GPU 1, for instance:

# export CUDA_VISIBLE_DEVICES=1; nice java -jar my.jar ...

zaleslaw · 2020-10-22T09:49:11Z

@aday00 and @karllessard looks like we have working device spec now, but have no ability to apply it for our graph or model (we have only setDevice option in GraphBuilder, but we could not pass it through the current TF operands and as a result we have no control on device yet.)

Please, correct me if I'm wrong.

zaleslaw · 2020-10-22T09:50:48Z

Maybe in eager mode it's better move to runner, but for the static graph look like we need to pre-define it earlier

zaleslaw · 2020-10-22T09:53:08Z

Yet one problem with imported graphs and models. When and how we could bind the device spec? Maybe after import we should have a method on graph, like addGradients? But in reality we need to rebuild our graph with new device specs.

I've ready to contribute in this ticket, but need some advices and tips before start

karllessard · 2020-10-22T13:28:46Z

Hey @zaleslaw , we should probably add the ability to add the DeviceSpec to the Scope when building an operation, similar to what we do with the operation name.

So we can end up doing something like:

tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...)

If you feel that the syntax for building a device spec can still be improved, feel free to update it, thanks!

zaleslaw · 2020-10-22T15:19:29Z

@karllessard Yep, ok, I will try to do it. I suppose it could be a few approaches (tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...) and alternative approach, where you could choose device on operation level, not on scope see below

  public <T extends TNumber> Abs<T> abs(Operand<T> x, DeviceSpec deviceSpec) {
    return Abs.create(scope, x, deviceSpec);
  }

  public static <T extends TNumber> Abs<T> create(Scope scope, Operand<T> x, DeviceSpec deviceSpec) {
    OperationBuilder opBuilder = scope.env().opBuilder("Abs", scope.makeOpName("Abs"));
    opBuilder.addInput(x.asOutput());
    opBuilder.setDevice(deviceSpec.toString());
    opBuilder = scope.applyControlDependencies(opBuilder);
    return new Abs<T>(opBuilder.build());
  }

for Abs operand for example

karllessard · 2020-10-22T19:08:15Z

The reason for doing it at the scope level is to limit the number of operation factories, otherwise you will double-up the number of endpoints we have in the Ops classes (where one factory method = one endpoint). So instead of having 1000+ endpoints, we'd have now 2000+ endpoints :)

zaleslaw · 2020-10-23T16:24:56Z

I agree, that 2000 endpoints instead of 1000 is not so good, but the main idea of op device API has a very fine and detailed control over where operations are performed.
Scope is too high and does not give that degree of control

zaleslaw · 2020-10-23T16:26:06Z

Could you please explain what is the problem with endpoints increasing? Maybe I miss something in performance or this framework development?

Correct me if I'm wrong, but I see this like changing in java endpoints generation code (not 2000 minor changes), but 1 in generation process.

My idea to have two-level granularity (one operation and scope). Should keep in mind, that scope are not widely ised in our API including new API parts in framework. But if we have it in operation, it's could be easy expanded through each API (our API is operand-centric, not Scope)

Craigacp · 2020-10-23T16:29:31Z

Well at some point we'll run out of space in the class file.

saudet · 2020-10-24T00:32:32Z

Well at some point we'll run out of space in the class file.

I'd like to see that. Check what it looks like for MKL and it's still fine:
https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/gen/java/org/bytedeco/mkl/global/mkl_rt.java

karllessard · 2020-10-24T01:22:38Z

Like we discussed during the call today, my main concern of adding a new factory, and therefore duplicate all ops endpoints, is the noise that will create in the IDE's auto-complete feature, plus the fact that Scope was especially designed to handle this case.

aday00 · 2020-11-02T02:09:33Z

Excited to see this feature develop!
Will tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...) allow different parts of a neural network to be trained on different GPUs simultaneously?
May some parts of the neural networks be trained on CPU and other parts on GPU?
Happy to test this as well, when appropriate.

zaleslaw · 2020-11-02T07:21:54Z

I'll experiment with different modes. But I agree that it could be useful to make backprop and multiple matrices on different GPUs for different branches in models like InceptionV3 and so on. пн, 2 нояб. 2020 г. в 05:09, aday00 <notifications@github.com>:

…

Excited to see this feature develop! Will tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...) allow different parts of a neural network to be trained on different GPUs simultaneously? May some parts of the neural networks be trained on CPU and other parts on GPU? Happy to test this as well, when appropriate. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJEUHJNYHCUVF3DQ6UBON3SNYIGTANCNFSM4OET2PHA> .

aday00 · 2020-11-22T06:29:32Z

Sounds great @zaleslaw! For me, the models become so large that 1 GPU and 12GB GPU RAM is not enough to hold the entire model. Partitioning the model across 2 or more GPUs in a computer would make larger models work.

karllessard mentioned this issue Nov 3, 2020

The Java Tensorflow library does not seem to be using GPU #140

Open

zaleslaw mentioned this issue Nov 27, 2020

Add option to run TensorFlow job on the preferred device (via Scope) #159

Merged

zaleslaw closed this as completed Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to run TensorFlow job on the preferred device #77

Add option to run TensorFlow job on the preferred device #77

zaleslaw commented Jun 22, 2020

karllessard commented Jun 22, 2020

zaleslaw commented Jun 22, 2020 via email

Craigacp commented Jun 22, 2020

zaleslaw commented Jun 22, 2020 via email

karllessard commented Jun 22, 2020 •

edited

Loading

saudet commented Jun 23, 2020

karllessard commented Jun 23, 2020

hthu commented Jul 8, 2020

hthu commented Jul 12, 2020

aday00 commented Oct 15, 2020

aday00 commented Oct 15, 2020

zaleslaw commented Oct 22, 2020

zaleslaw commented Oct 22, 2020

zaleslaw commented Oct 22, 2020 •

edited

Loading

karllessard commented Oct 22, 2020

zaleslaw commented Oct 22, 2020 •

edited

Loading

karllessard commented Oct 22, 2020

zaleslaw commented Oct 23, 2020

zaleslaw commented Oct 23, 2020 •

edited

Loading

Craigacp commented Oct 23, 2020

saudet commented Oct 24, 2020

karllessard commented Oct 24, 2020 •

edited

Loading

aday00 commented Nov 2, 2020

zaleslaw commented Nov 2, 2020 via email

aday00 commented Nov 22, 2020

Add option to run TensorFlow job on the preferred device #77

Add option to run TensorFlow job on the preferred device #77

Comments

zaleslaw commented Jun 22, 2020

karllessard commented Jun 22, 2020

zaleslaw commented Jun 22, 2020 via email

Craigacp commented Jun 22, 2020

zaleslaw commented Jun 22, 2020 via email

karllessard commented Jun 22, 2020 • edited Loading

saudet commented Jun 23, 2020

karllessard commented Jun 23, 2020

hthu commented Jul 8, 2020

hthu commented Jul 12, 2020

aday00 commented Oct 15, 2020

aday00 commented Oct 15, 2020

zaleslaw commented Oct 22, 2020

zaleslaw commented Oct 22, 2020

zaleslaw commented Oct 22, 2020 • edited Loading

karllessard commented Oct 22, 2020

zaleslaw commented Oct 22, 2020 • edited Loading

karllessard commented Oct 22, 2020

zaleslaw commented Oct 23, 2020

zaleslaw commented Oct 23, 2020 • edited Loading

Craigacp commented Oct 23, 2020

saudet commented Oct 24, 2020

karllessard commented Oct 24, 2020 • edited Loading

aday00 commented Nov 2, 2020

zaleslaw commented Nov 2, 2020 via email

aday00 commented Nov 22, 2020

karllessard commented Jun 22, 2020 •

edited

Loading

zaleslaw commented Oct 22, 2020 •

edited

Loading

zaleslaw commented Oct 22, 2020 •

edited

Loading

zaleslaw commented Oct 23, 2020 •

edited

Loading

karllessard commented Oct 24, 2020 •

edited

Loading