OSError: [Errno 9] Bad file descriptor raised on program exit #50487

charliermarsh · 2021-06-28T15:00:02Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python version: Python 3.8.5
CUDA/cuDNN version: 11.2 / 8.1.0.77-1
GPU model and memory: P100

Describe the current behavior

When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:

Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Describe the expected behavior

Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)

Contributing

Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue

import tensorflow


def f():
    strategy = tensorflow.distribute.MirroredStrategy()
    with strategy.scope():
        tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
            tensorflow.keras.layers.Input(shape=(88, 88, 3))
        )


f()

Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).

The text was updated successfully, but these errors were encountered:

SysuJayce · 2021-06-29T06:33:23Z

can confirm this in tf2.5.0 from pypi

tilakrayal · 2021-06-29T09:00:05Z

@crm416 ,

Can you please try to execute the code in tf v2.5 and let us know if you are facing same issue? Thanks!

charliermarsh · 2021-06-29T13:22:41Z

@tilakrayal - Yes, this only occurs for me in tf v2.5 (and not in tf v2.3 or tf v2.4).

bryanlimy · 2021-08-27T11:03:59Z

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

jayfurmanek · 2021-08-30T17:54:20Z

I see a similar error when running the recommendation model from the models repo on TF2.5.0 and later on python3.8
https://github.com/tensorflow/models/tree/v2.5.1/official/recommendation

python ncf_keras_main.py --data_dir=./data --dataset=ml-1m

I0830 13:25:38.313174 140736018925024 ncf_keras_main.py:331] Keras evaluation is done.
I0830 13:25:38.313945 140736018925024 ncf_keras_main.py:555] Result is {'loss': 0.3801446557044983, 'eval_loss': 0.0, 'eval_hit_rate': 0.09089403396520465, 'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1630344333.780229>', 'BatchTimestamp<batch_index: 100, timestamp: 1630344337.5320396>'], 'train_finish_time': 1630344337.9485285, 'avg_exp_per_second': 2638725.9874249455}
Exception ignored in: <function Pool.__del__ at 0x7fffa3c7cf70>
Traceback (most recent call last):
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

jayfurmanek · 2021-08-30T20:52:58Z

The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one?
https://bugs.python.org/issue39995

npanpaliya · 2021-08-31T12:42:58Z

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

tekumara · 2021-12-19T00:19:26Z

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

Tingbopku · 2022-01-14T05:22:00Z

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).

This works for me, thank you!

negvet · 2022-02-04T14:25:36Z

For me in TF 2.5.0 the problem is hardware-dependant.
It is present with V100, but not with 2080 Ti.

- fixes lack of multiprocess thread pool teardown in TF Mirrored strategy as stated in tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

suchunxie · 2022-07-06T14:46:36Z

@npanpaliya I'm training Bert using the run_pretraining.py <https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py> here, and got the error of Bad descriptor. Then I referenced the post of yours, changed the python3.8/multiprocessing/pool.py file where shows the error.(see the picture below) [[image: error.png]](https://github.com/suchunxie/Random_forest/blob/main/error.png) (My environment is Ubuntu+docker+nvidia-tensorflow container. )

npanpaliya · 2022-07-06T15:41:35Z

@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. "one_device" is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.

suchunxie · 2022-07-07T02:40:45Z

Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there's a back slash lost before I pass --distribution_strategy. Stupid me > <.
Thanks greatly for your help !

npanpaliya · 2022-07-07T03:55:20Z

Hi @suchunxie, This is great! Glad to hear this! :)

QuantHao · 2022-07-19T05:17:08Z

It seems that a fix is submitted #56279 (comment) and users need to wait for tf 2.10 release.

gadagashwini · 2022-07-29T10:49:26Z

Hi @charliermarsh, Looks like issue is resolved with stable version Tensorflow 2.9

>>> import tensorflow
>>> def f():
...    strategy = tensorflow.distribute.MirroredStrategy()
...    with strategy.scope():
...       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
...             tensorflow.keras.layers.Input(shape=(88, 88, 3))
...         )
... 
>>> f()
2022-07-29 10:47:18.249928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.250980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.377905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.378958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.379854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.380701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.384793: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 10:47:18.841990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.842974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.843762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.844505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.845247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.846005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.846699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.847686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.848526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.849336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13725 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2022-07-29 10:47:20.854556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.855362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13791 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

ZJaume · 2022-07-29T12:56:40Z

Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:

In [1]: import tensorflow
   ...: def f():
   ...:    strategy = tensorflow.distribute.MirroredStrategy()
   ...:    with strategy.scope():
   ...:       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
   ...:             tensorflow.keras.layers.Input(shape=(88, 88, 3))
   ...:         )
   ...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory:  -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

google-ml-butler · 2022-08-05T13:46:13Z

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

gadagashwini · 2022-08-09T11:30:17Z

Hi @ZJaume, Could you share the system configuration, I am not able to replicate the issue. Thank you!

ZJaume · 2022-08-09T13:15:09Z

Hi, sorry for the inconvenience but now I've tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from 2.3 to 2.9. Maybe some outdated dependency is causing the error.

In case you want to reproduce it my versions are:
Tensorflow version: 2.9.1
Python version: 2.8.13
OS: Ubuntu 18.04

And the output of pip freeze:

absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
antlr4-python3-runtime==4.8
astunparse==1.6.3
async-timeout==4.0.1
atomicwrites==1.4.0
attrs==21.2.0
backcall==0.2.0
bitarray==2.3.7
blessed==1.19.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==8.0.3
colorama==0.4.4
Cython==0.29.24
dataclasses==0.6
datasets==1.16.1
decorator==5.1.0
dill==0.3.4
enlighten==1.10.1
fairseq==0.10.2
fastspell==0.1.5
fasttext==0.9.2
filelock==3.3.2
flatbuffers==1.12
frozenlist==1.2.0
fsspec==2021.11.1
ftfy==6.1.1
fuzzywuzzy==0.18.0
gast==0.4.0
gensim==4.1.2
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.41.1
h5py==3.1.0
hanzidentifier==1.0.2
huggingface-hub==0.1.0
hunspell==0.5.5
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
ipython==7.29.0
jedi==0.18.0
joblib==0.14.1
keras==2.9.0
Keras-Preprocessing==1.1.2
latexcodec==2.0.1
libclang==13.0.0
Markdown==3.3.4
matplotlib-inline==0.1.3
monocleaner==1.0
more-itertools==8.10.0
mtdata==0.3.1
multidict==5.2.0
multiprocess==0.70.12.2
nltk==3.6.5
numpy==1.23.0
oauthlib==3.1.1
omegaconf==2.1.1
opt-einsum==3.3.0
packaging==21.2
pandas==1.3.5
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pluggy==0.13.1
portalocker==2.3.0
prefixed==0.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.1
pybtex==0.24.0
pycld2==0.31
pycparser==2.21
Pygments==2.10.0
pyparsing==2.4.7
pypinyin==0.46.0
pytest==5.1.2
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyYAML==5.4.1
regex==2022.3.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
sacrebleu==2.1.0
sacremoses==0.0.43
scikit-learn==0.22.1
scipy==1.4.1
sentence-transformers==2.1.0
sentencepiece==0.1.94
six==1.15.0
smart-open==5.2.1
tabulate==0.8.9
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.0.0
tokenizers==0.12.1
toolwrapper==0.4.1
torch==1.10.1
torch-train==0.0.3
torchsummary==1.5.1
torchvision==0.11.2
tqdm==4.62.3
traitlets==5.1.1
transformers==4.20.1
typing-extensions==3.7.4.3
Unidecode==1.2.0
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.2
wrapt==1.12.1
xxhash==2.0.2
yarl==1.7.2
zhon==1.1.5
zipp==3.7.0

google-ml-butler · 2022-08-16T13:55:19Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2022-08-16T13:55:24Z

Are you satisfied with the resolution of your issue?
Yes
No

…ersion - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllReduce' object has no attribute '_pool'`. The issue comes from the workaround for tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

…ersion (#4522) - fixes the issue with the latest TensorFlow version and YOLO example that results in `AttributeError: 'CollectiveAllReduce' object has no attribute '_pool'`. The issue comes from the workaround for tensorflow/tensorflow#50487 Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown (tensorflow/tensorflow#50487 (comment))

We are currently at 2.9.1 which runs into some cosmetic errors (#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>

We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>

We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com>

We are currently at 2.9.1 which runs into some cosmetic errors (ray-project#25142, tensorflow/tensorflow#50487 (comment)). This PR upgrades to the latest Tensorflow release. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: elliottower <elliot@elliottower.com>

charliermarsh added the type:bug Bug label Jun 28, 2021

google-ml-butler bot assigned tilakrayal Jun 28, 2021

tilakrayal added TF 2.5 Issues related to TF 2.5 comp:dist-strat Distribution Strategy related issues comp:keras Keras related issues labels Jun 29, 2021

tilakrayal added TF 2.4 for issues related to TF 2.4 and removed TF 2.5 Issues related to TF 2.5 labels Jun 29, 2021

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jun 29, 2021

tilakrayal added TF 2.5 Issues related to TF 2.5 and removed TF 2.4 for issues related to TF 2.4 stat:awaiting response Status - Awaiting response from author labels Jun 29, 2021

tilakrayal assigned ymodak and unassigned tilakrayal Jun 29, 2021

ymodak added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed comp:keras Keras related issues labels Jun 29, 2021

Ivo-B mentioned this issue Nov 4, 2021

OSError: [Errno 9] Bad file descriptor Ivo-B/CC-DL-template-example#5

Open

procyontao mentioned this issue Feb 3, 2022

An error message after the last iteration IsoNet-cryoET/IsoNet#25

Open

JanuszL mentioned this issue Mar 16, 2022

Fix YOLO v4 example non-fatal teardown error NVIDIA/DALI#3739

Merged

18 tasks

gadagashwini removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 29, 2022

gadagashwini added the stat:awaiting response Status - Awaiting response from author label Jul 29, 2022

gadagashwini self-assigned this Jul 29, 2022

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Aug 5, 2022

google-ml-butler bot closed this as completed Aug 16, 2022

JanuszL mentioned this issue Dec 16, 2022

Update YOLO example for the latest to support the latest TensorFlow version NVIDIA/DALI#4522

Merged

18 tasks

krfricke mentioned this issue Feb 14, 2023

[ci/docker/ml] Upgrade tensorflow to 2.11.0 ray-project/ray#32511

Merged

7 tasks

lfoppiano mentioned this issue Sep 1, 2023

Training with multiple GPUs kermitt2/delft#164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: [Errno 9] Bad file descriptor raised on program exit #50487

OSError: [Errno 9] Bad file descriptor raised on program exit #50487

charliermarsh commented Jun 28, 2021 •

edited

Loading

SysuJayce commented Jun 29, 2021

tilakrayal commented Jun 29, 2021

charliermarsh commented Jun 29, 2021

bryanlimy commented Aug 27, 2021

jayfurmanek commented Aug 30, 2021 •

edited

Loading

jayfurmanek commented Aug 30, 2021

npanpaliya commented Aug 31, 2021

tekumara commented Dec 19, 2021

Tingbopku commented Jan 14, 2022

negvet commented Feb 4, 2022

suchunxie commented Jul 6, 2022 via email •

edited

Loading

npanpaliya commented Jul 6, 2022

suchunxie commented Jul 7, 2022

npanpaliya commented Jul 7, 2022

QuantHao commented Jul 19, 2022

gadagashwini commented Jul 29, 2022

ZJaume commented Jul 29, 2022

google-ml-butler bot commented Aug 5, 2022

gadagashwini commented Aug 9, 2022

ZJaume commented Aug 9, 2022

google-ml-butler bot commented Aug 16, 2022

google-ml-butler bot commented Aug 16, 2022

OSError: [Errno 9] Bad file descriptor raised on program exit #50487

OSError: [Errno 9] Bad file descriptor raised on program exit #50487

Comments

charliermarsh commented Jun 28, 2021 • edited Loading

SysuJayce commented Jun 29, 2021

tilakrayal commented Jun 29, 2021

charliermarsh commented Jun 29, 2021

bryanlimy commented Aug 27, 2021

jayfurmanek commented Aug 30, 2021 • edited Loading

jayfurmanek commented Aug 30, 2021

npanpaliya commented Aug 31, 2021

tekumara commented Dec 19, 2021

Tingbopku commented Jan 14, 2022

negvet commented Feb 4, 2022

suchunxie commented Jul 6, 2022 via email • edited Loading

npanpaliya commented Jul 6, 2022

suchunxie commented Jul 7, 2022

npanpaliya commented Jul 7, 2022

QuantHao commented Jul 19, 2022

gadagashwini commented Jul 29, 2022

ZJaume commented Jul 29, 2022

google-ml-butler bot commented Aug 5, 2022

gadagashwini commented Aug 9, 2022

ZJaume commented Aug 9, 2022

google-ml-butler bot commented Aug 16, 2022

google-ml-butler bot commented Aug 16, 2022

charliermarsh commented Jun 28, 2021 •

edited

Loading

jayfurmanek commented Aug 30, 2021 •

edited

Loading

suchunxie commented Jul 6, 2022 via email •

edited

Loading