Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate kernel registration errors when using a Pluggable Device with the GPU type #56707

Conversation

PatriceVignola
Copy link
Contributor

One of the supported scenarios outlined in the Pluggable Device RFC is to be able to have a pluggable device with the type GPU, and have it completely override the built-in TensorFlow GPU device. However, since pluggable device kernels are registered at runtime, here's roughly what is happening:

  1. TensorFlow is loaded
  2. TensorFlow initializes the GPU device
  3. TensorFlow registers all GPU kernels
  4. TensorFlow loads the pluggable device DLL
  5. TensorFlow overrides the built-in GPU device with the higher priority GPU device from the pluggable device
  6. The pluggable device tries to register its own kernels by using the GPU type
  7. TensorFlow throws an error since GPU kernels with the same priority have already been registered in step 3

Overriding the GPU device is not enough. Since the GPU kernels are statically registered, we need to unregister them as soon as a pluggable device with the same type comes online.

@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label Jul 8, 2022
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jul 8, 2022
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Jul 8, 2022
@gbaned gbaned requested a review from rohan100jain July 8, 2022 06:03
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Jul 8, 2022
@PatriceVignola
Copy link
Contributor Author

@gbaned @rohan100jain Could we get more eyes on this? This is an issue many users have faced since we released the preview of our plugin since most people use the tensorflow package, not tensorflow-cpu.

@gbaned gbaned requested a review from cheshire July 18, 2022 11:31
@gbaned
Copy link
Contributor

gbaned commented Jul 18, 2022

@gbaned @rohan100jain Could we get more eyes on this? This is an issue many users have faced since we released the preview of our plugin since most people use the tensorflow package, not tensorflow-cpu.

@PatriceVignola Sure, Thank you!

@PatriceVignola PatriceVignola changed the title Fix dupplicate kernel registration errors when using a Pluggable Device with the GPU type Fix duplicate kernel registration errors when using a Pluggable Device with the GPU type Jul 26, 2022
@PatriceVignola
Copy link
Contributor Author

@gbaned @rohan100jain @cheshire Can we get one more look at this PR? We were hoping to get it in the 2.10 release.

@rjdyk
Copy link

rjdyk commented Jul 28, 2022

@penpornk To add to Pat's comments above, here's the excerpt from the RFC that we're referring to for overriding the GPU device name

Supported scenario: Single PluggableDevice registered as "GPU" device type
In the case of installing one plugin that registers its PluggableDevice as "GPU" device type, the default GPUDevice will be overrided by PluggableDevice when the plugin is loaded. When user specifies the "GPU" device for ops under with tf.device("gpu:0"), PluggableDevice registered will be selected to run those ops.

@penpornk
Copy link
Member

Hi @PatriceVignola and @rjdyk,

Sorry for the delay!

The RFC states that there could be more than one device with the same device_type (e.g., "GPU") under the assumption that there will be another key, sub_device_type (see figures here), to distinguish between different devices during kernel lookup, e.g., {device_type="GPU", sub_device_type="DIRECTML"}.

However, the implementation PR sparked concerns about having two keys for kernel lookups (device_type and then device_alias, a rename of sub_device_type). After a long discussion, the conclusion was to keep using just one key (device_type), with an implication that device_type should be unique so that there wouldn't be kernel lookup conflicts. The TensorFlow-Metal plug-in uses device_type=GPU because tensorflow-macos package doesn't have a GPU device to begin with.

What this PR proposes (allowing plug-in to override native device_type and unregistering all existing kernels) hasn't been covered in previous design reviews. While the change is simple, we still need to consider possible side effects and how others might use PluggableDevice differently. For example,

  • TensorFlow might have code that checks for device_type=GPU for something CUDA/ROCm specific that might not work well with other PluggableDevices that would try to register themselves as "GPU".
  • There might be a PluggableDevice overriding device_type=CPU that still wants to use TF CPU kernels for ops that they don't cover. Unregistering all existing kernels at once won't work for this case. (May have to unregister only kernels that the plug-in also has instead.)
  • etc.

We will discuss this internally and get back to you.

What is your motivation for setting device_type=GPU (and not something else, e.g., DIRECTML_GPU)? Is it to let users use TF-DirectML without having to change codes with explicit device placement on "GPU"?

@PatriceVignola
Copy link
Contributor Author

@penpornk Thank you for the details.

Our motivation for using the GPU device type is that we didn't see any drawbacks, and it makes life easier for the users by not having to change their scripts. It also allows us to leverage a lot of the existing grappler optimizations (for example, like cuDNN, DML is faster with NCHW than NHWC). We were using the DML device type in our TF 1.15 fork, and the feedback we got from a lot of users is that being able to use GPU instead would make their life a lot easier.

Our assumption was that the device_type could be GPU and device_name was basically the equivalent of sub_device_type, which allows TensorFlow to differentiate between the built-in GPU devices and the other devices internally if needed.

Of course we can always add tensorflow-cpu as a hard requirement for our plugin, but since many users use the tensorflow package, we were hoping there would be a way to overwrite the GPU device and let users use the plugin as a drop-in replacement without having to install tensorflow-cpu. Because of the -cpu suffix, users are already relatively confused when we tell them that using tensorflow-cpu + our plugin will still use their gpu.

At the very least, if this behavior should really be disallowed, I feel like an error should be thrown at device registration time (e.g. "GPU device type is not a valid pluggable device type") instead of throwing an error when a duplicate kernel is found.

rjdyk added a commit to rjdyk/windows-ai-docs that referenced this pull request Aug 1, 2022
In response the behavior of `tensorflow-directml-plugin` not overriding the `GPU` device naming for the `tensorflow` plugin (tensorflow/tensorflow#56707), we are setting `tensorflow-cpu` as a hard requirement for the time being.
@gbaned gbaned requested review from penpornk and removed request for penpornk August 9, 2022 14:24
@gbaned
Copy link
Contributor

gbaned commented Dec 29, 2022

Hi @penpornk Can you please review this PR ? Thank you!

7 similar comments
@gbaned
Copy link
Contributor

gbaned commented Mar 21, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Apr 25, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Jun 23, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Aug 25, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Sep 25, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Nov 7, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Dec 1, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Dec 14, 2023

Hi @penpornk Can you please review this PR ? Thank you!

3 similar comments
@gbaned
Copy link
Contributor

gbaned commented Dec 29, 2023

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Jan 19, 2024

Hi @penpornk Can you please review this PR ? Thank you!

@gbaned
Copy link
Contributor

gbaned commented Mar 8, 2024

Hi @penpornk Can you please review this PR ? Thank you!

@mihaimaruseac
Copy link
Collaborator

Closing stale PR

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Mar 25, 2024
@google-ml-butler google-ml-butler bot removed the awaiting review Pull request awaiting review label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:core issues related to core part of tensorflow size:S CL Change Size: Small
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

5 participants