Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado #60309

CaffineAddic · 2023-04-13T05:24:00Z

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

source

Tensorflow Version

2.12.0

Custom Code

Yes

OS Platform and Distribution

NAME="CentOS Linux" VERSION="7 (Core)"

Mobile device

NAME="CentOS Linux" VERSION="7 (Core)"

Python version

3.9.16

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

11.8.0

GPU model and memory

No response

Current Behaviour?

The model training would just stop abruptly

https://colab.research.google.com/drive/1WiqyF7dCdnNBIANEY80Pxw_mVz4fyV-S?usp=sharing

Standalone code to reproduce the issue

Voxelmoprh library training

Relevant log output

(tf) vr-lab@pop-os:~$ jupyter notebook

  _   _          _      _
 | | | |_ __  __| |__ _| |_ ___
 | |_| | '_ \/ _` / _` |  _/ -_)
  \___/| .__/\__,_\__,_|\__\___|
       |_|
                       
Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.

https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html

Please note that updating to Notebook 7 might break some of your extensions.

[I 00:02:49.290 NotebookApp] Serving notebooks from local directory: /home/vr-lab
[I 00:02:49.290 NotebookApp] Jupyter Notebook 6.5.4 is running at:
[I 00:02:49.290 NotebookApp] http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp]  or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 00:02:49.334 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///home/vr-lab/.local/share/jupyter/runtime/nbserver-405435-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
     or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:03:15.170 NotebookApp] Kernel started: 4915aa8a-d4aa-4d50-885f-810d53eae7db, name: python3
[I 00:03:20.670 NotebookApp] Kernel restarted: 4915aa8a-d4aa-4d50-885f-810d53eae7db
[W 00:03:20.684 NotebookApp] Replacing stale connection: 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[W 00:03:21.180 NotebookApp] zmq message arrived on closed channel
[I 00:03:21.181 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.183 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.689 NotebookApp] Replaying 1 buffered messages
[E 00:03:21.761 NotebookApp] Uncaught exception, closing connection.
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
        self._handle_write()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
        self._write_buffer.advance(num_bytes)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
        assert 0 < size <= self._size
    AssertionError
[W 00:03:21.764 NotebookApp] Write error on <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>: [Errno 9] Bad file descriptor
[W 00:03:21.766 NotebookApp] zmq message arrived on closed channel
[W 00:03:21.767 NotebookApp] zmq message arrived on closed channel
Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 206, in _handle_events
    handler_func(fileobj, events)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
    self._handle_write()
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
    self._write_buffer.advance(num_bytes)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
    assert 0 < size <= self._size
AssertionError
[I 00:03:21.768 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
2023-04-11 00:03:22.084618: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 00:03:22.225493: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[I 00:03:22.803 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:22.803 NotebookApp] Replaying 1 buffered messages
2023-04-11 00:03:22.815590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815709: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815716: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-04-11 00:03:25.015062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12776 MB memory:  -> device: 0, name: NVIDIA RTX A4000, pci bus id: 0000:af:00.0, compute capability: 8.6
2023-04-11 00:03:40.078576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8600
[I 00:05:15.159 NotebookApp] Saving file at /Music/HybridMorph Please don't delete/HybridMorph_proof of concept.ipynb
Task exception was never retrieved
future: <Task finished name='Task-76' coro=<WebSocketProtocol13.write_message.<locals>.wrapper() done, defined at /home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py:1090> exception=WebSocketClosedError()>
Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1092, in wrapper
    await fut
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1094, in wrapper
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
[E 01:03:52.904 NotebookApp] Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(4915aa8a-d4aa-4d50-885f-810d53eae7db)>
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
        val = self.callback()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 188, in send_ping
        self.ping(b'')
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 445, in ping
        self.ws_connection.write_ping(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1101, in write_ping
        self._write_frame(True, 0x9, data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.812 NotebookApp] Uncaught exception in ZMQStream callback
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Uncaught exception in zmqstream callback
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
        self._handle_recv()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
        self._run_callback(callback, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Exception in callback functools.partial(<function ZMQStream._update_handler.<locals>.<lambda> at 0x7f1de4ff4b80>)
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 740, in _run_callback
        ret = callback()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 718, in <lambda>
        self.io_loop.add_callback(lambda: self._handle_events(self.socket, 0))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
        self._handle_recv()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
        self._run_callback(callback, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized

tilakrayal · 2023-04-14T05:25:33Z

@CaffineAddic,
I tried to execute the mentioned code with the tensorflow v2.12 and it was executed without any issues. Kindly find the gist of it here and also looks like the error which you stated was related to tensorflow.

The error which was mentioned was discussed in Tornado issue 2008. The current theory is that it only happens when threads are being used incorrectly, but this is not certain.
Reference. Thank you!

CaffineAddic · 2023-04-14T05:29:43Z

Thanks for the reply, actually the code works well for 150 epochs with 100 steps per epoch but it stops with this error at any range of epochs from 70 to 200.

CaffineAddic · 2023-04-14T07:05:10Z

https://github.com/CaffineAddic/HybridMorph-proof-of-concept-.git

Can you see if this one works, I used this code to train the models before but now after the update it's failing

tilakrayal · 2023-04-28T05:41:03Z

@CaffineAddic,
While I was accessing the issue, I was unable to view any code in the above link. Could you please provide the colab gist which helps to analyse the issue in an effective way. Thank you!

CaffineAddic · 2023-04-28T10:49:45Z

https://github.com/CaffineAddic/HybridMorph-proof-of-concept-/blob/main/HybridMorph_proof%20of%20concept.ipynb

CaffineAddic · 2023-04-28T10:50:12Z

I am running it on my local machine

tilakrayal · 2023-05-13T07:34:57Z

Could you please provide the model_loc = 'Models/ and csv_loc = 'CSV/' datasets and the models which you are trying to execute the code in the reproducible format. Thank you!

CaffineAddic · 2023-05-13T08:27:03Z

There is no data-set needed to run this
One of the data-set is provided by the library other one is generated during the execution.
Those two location are just to store the model weights as .h5 file and the other has the location of the CSV file where it will store errors per step.
Just keep both as random temp folder.
Thank you

Floppa2003 · 2023-05-24T15:50:53Z

Having the same issue right now while training models, tornado version 6.3.2 and tensorflow version 2.12.0.

CaffineAddic · 2023-06-05T04:21:53Z

Anyone got any fix??

tilakrayal · 2023-06-05T04:27:16Z

@CaffineAddic,
Apologies for the delay. We are working on the issue and will update the status here. Thank you!

CaffineAddic · 2023-06-05T11:17:31Z

Thanks a lot for the reply, Good luck.

akellehe · 2023-06-08T04:23:19Z

(also seeing this issue)

sachinprasadhs · 2023-06-26T18:59:13Z

@CaffineAddic @akellehe , Could you please try to provide more information on this to debug the root cause of the issue.
Also, are you facing the similar behavior in other environments as well?

CaffineAddic · 2023-06-28T15:55:17Z

Yess, over multiple systems configurations, I have tried run it over multiple fresh installs of tf 2.12

andsav · 2023-07-23T22:40:28Z

I am getting the same error and stacktrace with a pytorch model with MPS backend from a jupyter notebook. The model continues training, but output stops streaming to jupyter.
I suspect the problem is actually with jupyter and that websocket that allows streaming data from the python backend to the output cell.

CaffineAddic · 2023-07-24T15:50:03Z

Did you find any fix for it ??

zhangyilin · 2023-07-27T01:12:00Z

Same here. The error just random shows up. Some times after 100 epochs, sometimes at much earlier.

2023-07-26 17:43:54 [E 00:43:54.737 NotebookApp] Uncaught exception in zmqstream callback
2023-07-26 17:43:54 Traceback (most recent call last):
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
2023-07-26 17:43:54 self._handle_recv()
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
2023-07-26 17:43:54 self._run_callback(callback, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
2023-07-26 17:43:54 f = callback(*args, **kwargs)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
2023-07-26 17:43:54 return callback(self, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
2023-07-26 17:43:54 super()._on_zmq_reply(stream, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
2023-07-26 17:43:54 self.write_message(msg, binary=isinstance(msg, bytes))
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 339, in write_message
2023-07-26 17:43:54 return self.ws_connection.write_message(message, binary=binary)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1086, in write_message
2023-07-26 17:43:54 fut = self._write_frame(True, opcode, message, flags=flags)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1061, in _write_frame
2023-07-26 17:43:54 return self.stream.write(frame)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 540, in write
2023-07-26 17:43:54 self._write_buffer.append(data)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
2023-07-26 17:43:54 b += data # type: ignore
2023-07-26 17:43:54 BufferError: Existing exports of data: object cannot be re-sized
2023-07-26 17:43:54 Exception in callback BaseAsyncIOLoop._handle_events(28, 1)
2023-07-26 17:43:54 handle: <Handle BaseAsyncIOLoop._handle_events(28, 1)>
2023-07-26 17:43:54 Traceback (most recent call last):
2023-07-26 17:43:54 File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
2023-07-26 17:43:54 self._context.run(self._callback, *self._args)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 206, in _handle_events
2023-07-26 17:43:54 handler_func(fileobj, events)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
2023-07-26 17:43:54 self._handle_recv()
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
2023-07-26 17:43:54 self._run_callback(callback, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
2023-07-26 17:43:54 f = callback(*args, **kwargs)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
2023-07-26 17:43:54 return callback(self, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
2023-07-26 17:43:54 super()._on_zmq_reply(stream, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
2023-07-26 17:43:54 self.write_message(msg, binary=isinstance(msg, bytes))
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 339, in write_message
2023-07-26 17:43:54 return self.ws_connection.write_message(message, binary=binary)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1086, in write_message
2023-07-26 17:43:54 fut = self._write_frame(True, opcode, message, flags=flags)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1061, in _write_frame
2023-07-26 17:43:54 return self.stream.write(frame)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 540, in write
2023-07-26 17:43:54 self._write_buffer.append(data)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
2023-07-26 17:43:54 b += data # type: ignore
2023-07-26 17:43:54 BufferError: Existing exports of data: object cannot be re-sized

tgoMota · 2023-08-07T00:36:08Z

Some random times I was getting this error training the keras models and the training was being stopped.
Seems like the problem is on threads sync streaming the large output.

So, turning off the verbose of fit() method worked for me.
i.e: model.fit(trainX, trainY, ... , verbose=0)

I guess you can also use verbose=2 for showing just the final details for each epoch and it will work fine.

CaffineAddic · 2023-08-13T09:58:36Z

I will test it thank you @tgoMota

CaffineAddic · 2023-08-13T16:59:40Z

Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[E 16:53:16.352 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
Exception in callback BaseAsyncIOLoop._handle_events(33, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(33, 1)>
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/asyncio/events.py", line 80 , in _run
self._context.run(self._callback, *self._args)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/platf orm/asyncio.py", line 206, in _handle_events
handler_func(fileobj, events)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/serv ices/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base /zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 182, in advance
assert 0 < size <= self._size
AssertionError

Still this error is persistent @tgoMota

camda03 · 2023-10-29T00:31:43Z

I've encountered the same or a similar issue with TF 2.13.

docker run --gpus all --rm -u $(id -u):$(id -g) -p 8888:8888 -p 6006:6006 -v $PWD/:/tf/david_home tensorflow/tensorflow:2.13.0-gpu-jupyter

In my case output of the notebook to Firefox has frozen but the training job still seems to be running.
FWIW when I run the same notebook using TF 2.13 with the LambdaStack, this does not happen i.e. output works as expected.

E 12:40:01.689 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[E 12:40:01.691 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
ERROR:asyncio:Exception in callback BaseAsyncIOLoop._handle_events(29, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(29, 1)>
Traceback (most recent call last):
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 192, in _handle_events
handler_func(fileobj, events)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized

CaffineAddic · 2023-11-06T07:03:25Z

as @tgoMota mentioned try keeping

verbose=2

This worked for me

sachinprasadhs · 2023-11-06T19:25:04Z

@CaffineAddic , Is this still an issue? if the issue is resolved by changing verbose=0 to verbose=2, could you please close the issue.
Also, use the latest TensorFlow version to get the latest updates. Thank you!

CaffineAddic · 2023-11-06T19:32:27Z

Ya essentially allowed me to start training but I cannot comment on the rest of users, also shouldn't training should happen with the default verbose value, tornado errors are still there if you want to I can close this issue, Thank you for your time and support.

CaffineAddic · 2024-01-16T02:23:20Z

Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(bccd1408-a4a4-4810-805f-d10d0d4585df)>
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 188, in send_ping
self.ping(b'')
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 445, in ping
self.ws_connection.write_ping(data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1101, in write_ping
self._write_frame(True, 0x9, data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[I 01:36:07.003 NotebookApp] Saving file at /New_BRaTS/Brain_data/HybridMorph_proof of concept.ipynb
[E 01:36:28.774 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary

Same error with verbose = 2. Any fixes

google-ml-butler bot added the type:bug Bug label Apr 13, 2023

google-ml-butler bot assigned tilakrayal Apr 13, 2023

tilakrayal added the TF 2.12 For issues related to Tensorflow 2.12 label Apr 14, 2023

tilakrayal added comp:apis Highlevel API related issues stat:awaiting response Status - Awaiting response from author labels Apr 14, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 14, 2023

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 28, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 28, 2023

tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 13, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 13, 2023

tilakrayal assigned sachinprasadhs and unassigned sachinprasadhs and tilakrayal Jun 23, 2023

sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Jun 26, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2023

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 28, 2023

sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Nov 6, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Nov 6, 2023

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 6, 2023

Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado #60309

Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado #60309

Comments

CaffineAddic commented Apr 13, 2023 • edited by google-ml-butler bot

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

tilakrayal commented Apr 14, 2023 • edited

CaffineAddic commented Apr 14, 2023

CaffineAddic commented Apr 14, 2023

tilakrayal commented Apr 28, 2023

CaffineAddic commented Apr 28, 2023

CaffineAddic commented Apr 28, 2023

tilakrayal commented May 13, 2023

CaffineAddic commented May 13, 2023

Floppa2003 commented May 24, 2023

CaffineAddic commented Jun 5, 2023

tilakrayal commented Jun 5, 2023

CaffineAddic commented Jun 5, 2023

akellehe commented Jun 8, 2023

sachinprasadhs commented Jun 26, 2023

CaffineAddic commented Jun 28, 2023

andsav commented Jul 23, 2023

CaffineAddic commented Jul 24, 2023

zhangyilin commented Jul 27, 2023

tgoMota commented Aug 7, 2023 • edited

CaffineAddic commented Aug 13, 2023

CaffineAddic commented Aug 13, 2023

camda03 commented Oct 29, 2023 • edited

CaffineAddic commented Nov 6, 2023

sachinprasadhs commented Nov 6, 2023

CaffineAddic commented Nov 6, 2023

CaffineAddic commented Jan 16, 2024

CaffineAddic commented Apr 13, 2023 •

edited by google-ml-butler bot

tilakrayal commented Apr 14, 2023 •

edited

tgoMota commented Aug 7, 2023 •

edited

camda03 commented Oct 29, 2023 •

edited