Fix duplicate kernel registration errors when using a Pluggable Device with the `GPU` type #56707

PatriceVignola · 2022-07-08T01:29:11Z

One of the supported scenarios outlined in the Pluggable Device RFC is to be able to have a pluggable device with the type GPU, and have it completely override the built-in TensorFlow GPU device. However, since pluggable device kernels are registered at runtime, here's roughly what is happening:

TensorFlow is loaded
TensorFlow initializes the GPU device
TensorFlow registers all GPU kernels
TensorFlow loads the pluggable device DLL
TensorFlow overrides the built-in GPU device with the higher priority GPU device from the pluggable device
The pluggable device tries to register its own kernels by using the GPU type
TensorFlow throws an error since GPU kernels with the same priority have already been registered in step 3

Overriding the GPU device is not enough. Since the GPU kernels are statically registered, we need to unregister them as soon as a pluggable device with the same type comes online.

…ce with the GPU type

PatriceVignola · 2022-07-15T15:05:34Z

@gbaned @rohan100jain Could we get more eyes on this? This is an issue many users have faced since we released the preview of our plugin since most people use the tensorflow package, not tensorflow-cpu.

gbaned · 2022-07-18T11:32:48Z

@gbaned @rohan100jain Could we get more eyes on this? This is an issue many users have faced since we released the preview of our plugin since most people use the tensorflow package, not tensorflow-cpu.

@PatriceVignola Sure, Thank you!

PatriceVignola · 2022-07-26T22:38:20Z

@gbaned @rohan100jain @cheshire Can we get one more look at this PR? We were hoping to get it in the 2.10 release.

rjdyk · 2022-07-28T21:59:27Z

@penpornk To add to Pat's comments above, here's the excerpt from the RFC that we're referring to for overriding the GPU device name

Supported scenario: Single PluggableDevice registered as "GPU" device type
In the case of installing one plugin that registers its PluggableDevice as "GPU" device type, the default GPUDevice will be overrided by PluggableDevice when the plugin is loaded. When user specifies the "GPU" device for ops under with tf.device("gpu:0"), PluggableDevice registered will be selected to run those ops.

penpornk · 2022-07-29T06:31:57Z

Hi @PatriceVignola and @rjdyk,

Sorry for the delay!

The RFC states that there could be more than one device with the same device_type (e.g., "GPU") under the assumption that there will be another key, sub_device_type (see figures here), to distinguish between different devices during kernel lookup, e.g., {device_type="GPU", sub_device_type="DIRECTML"}.

However, the implementation PR sparked concerns about having two keys for kernel lookups (device_type and then device_alias, a rename of sub_device_type). After a long discussion, the conclusion was to keep using just one key (device_type), with an implication that device_type should be unique so that there wouldn't be kernel lookup conflicts. The TensorFlow-Metal plug-in uses device_type=GPU because tensorflow-macos package doesn't have a GPU device to begin with.

What this PR proposes (allowing plug-in to override native device_type and unregistering all existing kernels) hasn't been covered in previous design reviews. While the change is simple, we still need to consider possible side effects and how others might use PluggableDevice differently. For example,

TensorFlow might have code that checks for device_type=GPU for something CUDA/ROCm specific that might not work well with other PluggableDevices that would try to register themselves as "GPU".
There might be a PluggableDevice overriding device_type=CPU that still wants to use TF CPU kernels for ops that they don't cover. Unregistering all existing kernels at once won't work for this case. (May have to unregister only kernels that the plug-in also has instead.)
etc.

We will discuss this internally and get back to you.

What is your motivation for setting device_type=GPU (and not something else, e.g., DIRECTML_GPU)? Is it to let users use TF-DirectML without having to change codes with explicit device placement on "GPU"?

PatriceVignola · 2022-07-29T07:02:27Z

@penpornk Thank you for the details.

Our motivation for using the GPU device type is that we didn't see any drawbacks, and it makes life easier for the users by not having to change their scripts. It also allows us to leverage a lot of the existing grappler optimizations (for example, like cuDNN, DML is faster with NCHW than NHWC). We were using the DML device type in our TF 1.15 fork, and the feedback we got from a lot of users is that being able to use GPU instead would make their life a lot easier.

Our assumption was that the device_type could be GPU and device_name was basically the equivalent of sub_device_type, which allows TensorFlow to differentiate between the built-in GPU devices and the other devices internally if needed.

Of course we can always add tensorflow-cpu as a hard requirement for our plugin, but since many users use the tensorflow package, we were hoping there would be a way to overwrite the GPU device and let users use the plugin as a drop-in replacement without having to install tensorflow-cpu. Because of the -cpu suffix, users are already relatively confused when we tell them that using tensorflow-cpu + our plugin will still use their gpu.

At the very least, if this behavior should really be disallowed, I feel like an error should be thrown at device registration time (e.g. "GPU device type is not a valid pluggable device type") instead of throwing an error when a duplicate kernel is found.

In response the behavior of `tensorflow-directml-plugin` not overriding the `GPU` device naming for the `tensorflow` plugin (tensorflow/tensorflow#56707), we are setting `tensorflow-cpu` as a hard requirement for the time being.

gbaned · 2022-12-29T16:28:33Z