New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Device Selector in TensorFlow 2.0 #26460
Comments
Duplicate of #25446 |
@jaingaurav is a replacement for setting |
@guptapriya: It is on my radar, but I need to still sync up with @alextp on the potential issues. |
A number of new API were added in |
Thanks for the update, @jaingaurav! I have tried new functionality via First, the API requires one to do During the list operation, TensorFlow creates a GPU context on every GPU, including ones that we're not planning to use. You can see how this is wasteful if we will run 8 TensorFlow processes on 8-GPU server, each taking up ~120MB of GPU memory, totaling almost 1GB of wasted GPU memory. Could you add a way to set visible devices w/o binding GPU contexts? Second, I noticed that our legacy usage of
|
@alsrgv: Are you sure that you the listing operation is causing the GPU memory to be used? The listing API was supposed to be a lightweight operation that would not involve any memory allocation. Did you experience this with the 1.0 or 2.0 nightly? Regarding the bug you mentioned with |
@jaingaurav, thanks for the quick response. I did experience it in The way I verify GPU memory usage is via
|
Thanks for the details. I have a fix for the regression that I am getting through code review. Looking into the GPU memory allocation issue now. I was able to reproduce it locally. |
@alsrgv: The fix for I am working on the GPU memory allocation issue. However, the fix requires quite a bit of code re-structuring. It seems we end up allocating memory as a function of querying CUDA capabilities. |
@jaingaurav, thanks for the fix! I see the following error: Looking forward to the memory usage fix. Memory usage could be caused by the creation of CUDA context. If that's the case, driver API should allow querying device capabilities w/o CUDA context (and memory usage). |
@alsrgv: Thanks for the previous behavior. I'll ensure that I maintain compatibility. I've got the memory issue almost fixed. One last change needed. In terms of releases, I will ensure the regression is cherry-picked into 1.14. However, for the memory issue, can it wait till 1.15 & 2.0 or would you like it for 1.14 as well? Just depends on what you'd like to support. |
@jaingaurav, looking forward to the fixes! It would be great if memory issue fix can be picked in 1.14 as well. 1.13 did have memory issue with XLA (it was binding memory on all devices as well), and it was causing out of memory issues with cuDNN. So there are no release w/o memory issues since 1.12. |
@jaingaurav, yes, that's it - hence the request to pick the fix for this memory issue into 1.14 branch as well. |
@alsrgv: All known issues should be fixed in tonight's nightly. Once you confirm the behavior, I will speak to the release team about trying to get the memory fixes into 1.14. The changes weren't too bad, but they do incur some risk to get into the release. |
@jaingaurav, thanks for the fixes! The memory leak with I'm still getting an error with
I tried this on CPU build with the same outcome. |
Thanks @alsrgv. From the looks of it that nightly build might not have that latest change yet. We can re-verify in the next build. |
@jaingaurav, I just built the master from source and can confirm it works, thanks! |
@jaingaurav thanks a lot for this improvement! I can verify that it also fixes another old issue at #8136 (comment). import tensorflow as tf
print(tf.config.experimental.list_physical_devices('GPU'))
cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg) # do not fail do not fail when running on a 2-gpu machine. But it would fail if using Any chance we can have this great improvement into |
@ppwwyyxx: This is exactly why the new API was created. Unfortunately any changes to |
@jaingaurav, any news whether these fixes can be picked into r1.14? We'd really like a release since 1.12.x that has correct GPU memory binding behavior. cc @martinwicke |
Cherry-picking these bug fixes makes sense I think if the release is still
not too advanced.
…On Mon, Apr 29, 2019 at 12:11 PM Alex Sergeev ***@***.***> wrote:
@jaingaurav <https://github.com/jaingaurav>, any news whether these fixes
can be picked into r1.14? We'd really like a release since 1.12.x that has
correct GPU memory binding behavior.
cc @martinwicke <https://github.com/martinwicke>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26460 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABHRNOWJIJREFDXPMTX53PS5B43ANCNFSM4G4Q2REA>
.
--
- Alex
|
@alsrgv: We had a chat about it this morning. Given the status of the 1.14, we're going to aim to cherry-pick the memory fixes. We'd greatly appreciate any testing that you can do to help ensure that we don't incur any regressions and that we have everything you need in 1.14. Thank you for all that you have done so far! |
@jaingaurav, perfect, thanks! I will test 1.14 RCs as they come out. |
@jaingaurav Hi, I try to use device selector with tf2, but there are still some problems: In [1]: import tensorflow as tf In [2]: tf.__version__ Out[2]: '2.0.0-dev20190606' In [3]: gpus = tf.config.experimental.list_physical_devices("GPU") In [4]: gpus Out[4]: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] In [5]: tf.config.experimental.set_visible_devices(gpus[0], 'GPU') In [6]: tf.config.experimental.set_memory_growth(gpus[0], True) In [7]: tf.constant(1) ... ntext.py in _compute_gpu_options(self) 851 memory_growths = set(self._memory_growth_map.values()) 852 if len(memory_growths) > 1: --> 853 raise ValueError("Memory growth cannot differ between GPU devices") 854 allow_growth = memory_growths.pop() 855 else: ValueError: Memory growth cannot differ between GPU devices |
@llan-ml: Please see the updated guide at https://www.tensorflow.org/beta/guide/using_gpu#limiting_gpu_memory_growth. Currently we require the memory growth option to be uniform across all GPUs. This may change in the future if someone were to implement the changes. |
@jaingaurav What I mean is that after I have selected a GPU by calling For now, I still have to select a specific GPU by setting |
This was discovered in Issue #26460. PiperOrigin-RevId: 253082055
@llan-ml: Thanks, that is indeed a valid bug. The fix has been pushed now and I'll ensure it makes it into the upcoming 1.14 release. |
@jaingaurav I am seeing a lot of memory issues when using depthwise conv2d native. Seems like it does not respect the session config visible devices list and grabs all the GPUs. You mentioned this above - "I am working on the GPU memory allocation issue. However, the fix requires quite a bit of code re-structuring. It seems we end up allocating memory as a function of querying CUDA capabilities." I am wondering if these fixes made their way to TF 1.14. Can you point me to any PRs with these fixes? Thanks! |
@pidajay: How are you querying the GPUs, could you share a code snippet? Note this issue was primarily focused on querying GPUs with eager execution. If you are using sessions and experiencing issues there might be something else going on. Here is a tutorial on how to use the new APIs: |
@jaingaurav Appreciate the response. I am using an estimator (TPU estimator but running on GPU). Single GPU is fine but problem shows up when distributing across multiple GPUs (I use horovod. But horovod does not seem to be the issue here). I create the visible device list as follows at the beginning of the program
But I notice that the estimator violates the visible device list above and allocates all the GPUs. And this happens only when using depthwise conv 2D. |
Hi @jaingaurav It seems that disabling all GPUs does not work properly. I ran the following code
Although the variable |
What if users want specific 2 GPUs outof 4:
How to enable 2 of them. |
As the documentation said, you can give it a list of devices. |
Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template
System information
Describe the feature and the current behavior/state.
TensorFlow 1.x support specifying GPU devices to use:
There's no comparable API in TensorFlow 2.0. The closest option is to use the
CUDA_VISIBLE_DEVICES
environment variable. Unfortunately,CUDA_VISIBLE_DEVICES
prevents processes from doingcudaMemcpy
from/to devices not owned by the process. There's a significant performance degradation when NCCL is used with P2P communication disabled.The ask is to add an API to TensorFlow 2.0 to enable device selection.
Will this change the current api? How?
Yes, will introduce an API to select GPU devices to use.
Who will benefit with this feature?
Users of Horovod.
Any Other info.
cc @azaks2 @alextp @jaingaurav @guptapriya
The text was updated successfully, but these errors were encountered: