Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to run TensorFlow job on the preferred device #77

Closed
zaleslaw opened this issue Jun 22, 2020 · 25 comments
Closed

Add option to run TensorFlow job on the preferred device #77

zaleslaw opened this issue Jun 22, 2020 · 25 comments

Comments

@zaleslaw
Copy link
Contributor

System information

  • TensorFlow version (you are using): 1.15 or 2.x
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Now, we could run our code on GPU only via adding GPU dependencies to the classpath.
But the basic Python API provides an ability to set up the preferred device (GPU or CPU via device name)

The basic option also is available for low-level builder here

Will this change the current api? How?

Let's add the function tf.withDevice(“/GPU:0”) to the Scope class.

Who will benefit with this feature?
Anyone who trains neural network in distributed mode on different GPU/CPU devices.

Any Other info.

@karllessard
Copy link
Collaborator

Sounds good @zaleslaw !

I suggest that we go with a better typed API, maybe using some sort of enum, instead of hardcoded string like this (e.g. something like tf.withDevice(Device.GPU) or tf.withDevice(Device.GPU, 1), etc. Anyhow, something we can define together as we go, wdyt?

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Jun 22, 2020 via email

@Craigacp
Copy link
Collaborator

I'm in favour of an enum, and then an optional device ID number. We'll need it to specify what GPU something runs on, and I use shared GPU resources so I need to not contend with other users on different GPUs.

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Jun 22, 2020 via email

@karllessard
Copy link
Collaborator

karllessard commented Jun 22, 2020

Yes, that was my proposal as well. I just want to make sure that we are not missing anything, I don't know if TensorFlow supports more options than simply /<device>:<id> in their string format, can you please double check @zaleslaw ?

@saudet
Copy link
Contributor

saudet commented Jun 23, 2020

The string also accepts host names and a lot of other things we'll have problems fitting in just enums:
https://www.tensorflow.org/guide/gpu
https://www.tensorflow.org/guide/tpu

@karllessard
Copy link
Collaborator

So probably some kind of builder could be more appropriate then? e.g. Device.job("localhost").replica(0).task(0).gpu(1)

@hthu
Copy link
Contributor

hthu commented Jul 8, 2020

Drive by comment.

Maybe look into existing python class?

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/device_spec.py

@hthu
Copy link
Contributor

hthu commented Jul 12, 2020

I'm sending #83 for getting the DeviceSpec ported as a start.

@aday00
Copy link

aday00 commented Oct 15, 2020

Great idea! Happy to help test this with a multi-GPU system (#131).

Additional question: is there a command-line option to java to explicitly select which GPU should be used? GPU 0 is selected by default, but my system has 4 GPUs, so I'd like to run 4 java processes at once, with each java process explicitly selecting a different GPU.

@aday00
Copy link

aday00 commented Oct 15, 2020

Additional question: is there a command-line option to java to explicitly select which GPU should be used?

As expected, the CUDA_VISIBLE_DEVICES environment variable may be used to select GPU 1, for instance:

# export CUDA_VISIBLE_DEVICES=1; nice java -jar my.jar ...

@zaleslaw
Copy link
Contributor Author

@aday00 and @karllessard looks like we have working device spec now, but have no ability to apply it for our graph or model (we have only setDevice option in GraphBuilder, but we could not pass it through the current TF operands and as a result we have no control on device yet.)

Please, correct me if I'm wrong.

@zaleslaw
Copy link
Contributor Author

Maybe in eager mode it's better move to runner, but for the static graph look like we need to pre-define it earlier

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Oct 22, 2020

Yet one problem with imported graphs and models. When and how we could bind the device spec? Maybe after import we should have a method on graph, like addGradients? But in reality we need to rebuild our graph with new device specs.

I've ready to contribute in this ticket, but need some advices and tips before start

@karllessard
Copy link
Collaborator

Hey @zaleslaw , we should probably add the ability to add the DeviceSpec to the Scope when building an operation, similar to what we do with the operation name.

So we can end up doing something like:

tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...)

If you feel that the syntax for building a device spec can still be improved, feel free to update it, thanks!

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Oct 22, 2020

@karllessard Yep, ok, I will try to do it. I suppose it could be a few approaches (tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...) and alternative approach, where you could choose device on operation level, not on scope see below

  public <T extends TNumber> Abs<T> abs(Operand<T> x, DeviceSpec deviceSpec) {
    return Abs.create(scope, x, deviceSpec);
  }

  public static <T extends TNumber> Abs<T> create(Scope scope, Operand<T> x, DeviceSpec deviceSpec) {
    OperationBuilder opBuilder = scope.env().opBuilder("Abs", scope.makeOpName("Abs"));
    opBuilder.addInput(x.asOutput());
    opBuilder.setDevice(deviceSpec.toString());
    opBuilder = scope.applyControlDependencies(opBuilder);
    return new Abs<T>(opBuilder.build());
  }

for Abs operand for example

@karllessard
Copy link
Collaborator

The reason for doing it at the scope level is to limit the number of operation factories, otherwise you will double-up the number of endpoints we have in the Ops classes (where one factory method = one endpoint). So instead of having 1000+ endpoints, we'd have now 2000+ endpoints :)

@zaleslaw
Copy link
Contributor Author

I agree, that 2000 endpoints instead of 1000 is not so good, but the main idea of op device API has a very fine and detailed control over where operations are performed.
Scope is too high and does not give that degree of control

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Oct 23, 2020

Could you please explain what is the problem with endpoints increasing? Maybe I miss something in performance or this framework development?

Correct me if I'm wrong, but I see this like changing in java endpoints generation code (not 2000 minor changes), but 1 in generation process.

My idea to have two-level granularity (one operation and scope). Should keep in mind, that scope are not widely ised in our API including new API parts in framework. But if we have it in operation, it's could be easy expanded through each API (our API is operand-centric, not Scope)

@Craigacp
Copy link
Collaborator

Well at some point we'll run out of space in the class file.

@saudet
Copy link
Contributor

saudet commented Oct 24, 2020

Well at some point we'll run out of space in the class file.

I'd like to see that. Check what it looks like for MKL and it's still fine:
https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/gen/java/org/bytedeco/mkl/global/mkl_rt.java

@karllessard
Copy link
Collaborator

karllessard commented Oct 24, 2020

Like we discussed during the call today, my main concern of adding a new factory, and therefore duplicate all ops endpoints, is the noise that will create in the IDE's auto-complete feature, plus the fact that Scope was especially designed to handle this case.

@aday00
Copy link

aday00 commented Nov 2, 2020

Excited to see this feature develop!
Will tf.withName("myOp").withDevice(DeviceSpec.newBuilder()...).nn.avgPool(...) allow different parts of a neural network to be trained on different GPUs simultaneously?
May some parts of the neural networks be trained on CPU and other parts on GPU?
Happy to test this as well, when appropriate.

@zaleslaw
Copy link
Contributor Author

zaleslaw commented Nov 2, 2020 via email

@aday00
Copy link

aday00 commented Nov 22, 2020

Sounds great @zaleslaw! For me, the models become so large that 1 GPU and 12GB GPU RAM is not enough to hold the entire model. Partitioning the model across 2 or more GPUs in a computer would make larger models work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants