-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSError: [Errno 9] Bad file descriptor raised on program exit #50487
Comments
can confirm this in tf2.5.0 from pypi |
@crm416 , Can you please try to execute the code in tf v2.5 and let us know if you are facing same issue? Thanks! |
@tilakrayal - Yes, this only occurs for me in tf v2.5 (and not in tf v2.3 or tf v2.4). |
Same issue in tf v2.6. The following code causes
with the following output:
Whereas the one below is fine
Also tested the same code snippet with tf v2.4 and it ran fine in both cases. |
I see a similar error when running the recommendation model from the models repo on TF2.5.0 and later on python3.8
|
The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? |
I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems. |
This happens in TF 2. 7 too with python 3.9 I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown. You can explicitly close the pool on exit using:
Which should prevent the error for now (until there is a fix). |
This works for me, thank you! |
For me in TF 2.5.0 the problem is hardware-dependant. |
- fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
- fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@npanpaliya I'm training Bert using the run_pretraining.py
<https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py>
here,
and got the error of Bad descriptor.
Then I referenced the post of yours, changed the
python3.8/multiprocessing/pool.py file where shows the error.(see the
picture below)
[[image: error.png]](https://github.com/suchunxie/Random_forest/blob/main/error.png)
(My environment is Ubuntu+docker+nvidia-tensorflow container. )
|
@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. "one_device" is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158. |
Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there's a back slash lost before I pass --distribution_strategy. Stupid me > <. |
Hi @suchunxie, This is great! Glad to hear this! :) |
It seems that a fix is submitted #56279 (comment) and users need to wait for tf 2.10 release. |
Hi @charliermarsh, Looks like issue is resolved with stable version Tensorflow 2.9
|
Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception: In [1]: import tensorflow
...: def f():
...: strategy = tensorflow.distribute.MirroredStrategy()
...: with strategy.scope():
...: tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
...: tensorflow.keras.layers.Input(shape=(88, 88, 3))
...: )
...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Hi @ZJaume, Could you share the system configuration, I am not able to replicate the issue. Thank you! |
Hi, sorry for the inconvenience but now I've tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from In case you want to reproduce it my versions are: And the output of
|
Closing as stale. Please reopen if you'd like to work on this further. |
…ersion - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllReduce' object has no attribute '_pool'`. The issue comes from the workaround for tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
…ersion (#4522) - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllReduce' object has no attribute '_pool'`. The issue comes from the workaround for tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown (tensorflow/tensorflow#50487 (comment))
We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>
We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>
System information
v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python 3.8.5
11.2
/8.1.0.77-1
Describe the current behavior
When using
MirroredStrategy
as a context manager, Python raises an ignored exception on program exit:Describe the expected behavior
Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)
Contributing
Standalone code to reproduce the issue
Removing the
strategy.scope()
causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid ofdef f()
andf()
, and invoking at the top level).The text was updated successfully, but these errors were encountered: