Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado #60309

Open
CaffineAddic opened this issue Apr 13, 2023 · 26 comments
Assignees
Labels
comp:apis Highlevel API related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.12 For issues related to Tensorflow 2.12 type:bug Bug

Comments

@CaffineAddic
Copy link

CaffineAddic commented Apr 13, 2023

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

source

Tensorflow Version

2.12.0

Custom Code

Yes

OS Platform and Distribution

NAME="CentOS Linux" VERSION="7 (Core)"

Mobile device

NAME="CentOS Linux" VERSION="7 (Core)"

Python version

3.9.16

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

11.8.0

GPU model and memory

No response

Current Behaviour?

The model training would just stop abruptly

https://colab.research.google.com/drive/1WiqyF7dCdnNBIANEY80Pxw_mVz4fyV-S?usp=sharing

Standalone code to reproduce the issue

Voxelmoprh library training

Relevant log output

(tf) vr-lab@pop-os:~$ jupyter notebook

  _   _          _      _
 | | | |_ __  __| |__ _| |_ ___
 | |_| | '_ \/ _` / _` |  _/ -_)
  \___/| .__/\__,_\__,_|\__\___|
       |_|
                       
Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.

https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html

Please note that updating to Notebook 7 might break some of your extensions.

[I 00:02:49.290 NotebookApp] Serving notebooks from local directory: /home/vr-lab
[I 00:02:49.290 NotebookApp] Jupyter Notebook 6.5.4 is running at:
[I 00:02:49.290 NotebookApp] http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp]  or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:02:49.290 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 00:02:49.334 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///home/vr-lab/.local/share/jupyter/runtime/nbserver-405435-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
     or http://127.0.0.1:8888/?token=697572ae046e4388d22c7be946cefcb261064994d2f99466
[I 00:03:15.170 NotebookApp] Kernel started: 4915aa8a-d4aa-4d50-885f-810d53eae7db, name: python3
[I 00:03:20.670 NotebookApp] Kernel restarted: 4915aa8a-d4aa-4d50-885f-810d53eae7db
[W 00:03:20.684 NotebookApp] Replacing stale connection: 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[W 00:03:21.180 NotebookApp] zmq message arrived on closed channel
[I 00:03:21.181 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.183 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:21.689 NotebookApp] Replaying 1 buffered messages
[E 00:03:21.761 NotebookApp] Uncaught exception, closing connection.
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
        self._handle_write()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
        self._write_buffer.advance(num_bytes)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
        assert 0 < size <= self._size
    AssertionError
[W 00:03:21.764 NotebookApp] Write error on <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>: [Errno 9] Bad file descriptor
[W 00:03:21.766 NotebookApp] zmq message arrived on closed channel
[W 00:03:21.767 NotebookApp] zmq message arrived on closed channel
Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 206, in _handle_events
    handler_func(fileobj, events)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 702, in _handle_events
    self._handle_write()
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 976, in _handle_write
    self._write_buffer.advance(num_bytes)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 182, in advance
    assert 0 < size <= self._size
AssertionError
[I 00:03:21.768 NotebookApp] Starting buffering for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
2023-04-11 00:03:22.084618: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 00:03:22.225493: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[I 00:03:22.803 NotebookApp] Restoring connection for 4915aa8a-d4aa-4d50-885f-810d53eae7db:e6146c4b818f471185049a02ac632f6d
[I 00:03:22.803 NotebookApp] Replaying 1 buffered messages
2023-04-11 00:03:22.815590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815709: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/vr-lab/anaconda3/envs/tf/lib/:/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib
2023-04-11 00:03:22.815716: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-04-11 00:03:25.015062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12776 MB memory:  -> device: 0, name: NVIDIA RTX A4000, pci bus id: 0000:af:00.0, compute capability: 8.6
2023-04-11 00:03:40.078576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8600
[I 00:05:15.159 NotebookApp] Saving file at /Music/HybridMorph Please don't delete/HybridMorph_proof of concept.ipynb
Task exception was never retrieved
future: <Task finished name='Task-76' coro=<WebSocketProtocol13.write_message.<locals>.wrapper() done, defined at /home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py:1090> exception=WebSocketClosedError()>
Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1092, in wrapper
    await fut
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1094, in wrapper
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError
[E 01:03:52.904 NotebookApp] Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(4915aa8a-d4aa-4d50-885f-810d53eae7db)>
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
        val = self.callback()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 188, in send_ping
        self.ping(b'')
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 445, in ping
        self.ws_connection.write_ping(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1101, in write_ping
        self._write_frame(True, 0x9, data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.812 NotebookApp] Uncaught exception in ZMQStream callback
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Uncaught exception in zmqstream callback
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
        self._handle_recv()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
        self._run_callback(callback, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
[E 01:13:22.815 NotebookApp] Exception in callback functools.partial(<function ZMQStream._update_handler.<locals>.<lambda> at 0x7f1de4ff4b80>)
    Traceback (most recent call last):
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 740, in _run_callback
        ret = callback()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 718, in <lambda>
        self.io_loop.add_callback(lambda: self._handle_events(self.socket, 0))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
        self._handle_recv()
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
        self._run_callback(callback, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
        f = callback(*args, **kwargs)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
        return callback(self, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
        super()._on_zmq_reply(stream, msg)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
        self.write_message(msg, binary=isinstance(msg, bytes))
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
        return self.ws_connection.write_message(message, binary=binary)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1086, in write_message
        fut = self._write_frame(True, opcode, message, flags=flags)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
        return self.stream.write(frame)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
        self._write_buffer.append(data)
      File "/home/vr-lab/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
        b += data  # type: ignore
    BufferError: Existing exports of data: object cannot be re-sized
@google-ml-butler google-ml-butler bot added the type:bug Bug label Apr 13, 2023
@tilakrayal tilakrayal added the TF 2.12 For issues related to Tensorflow 2.12 label Apr 14, 2023
@tilakrayal
Copy link
Contributor

tilakrayal commented Apr 14, 2023

@CaffineAddic,
I tried to execute the mentioned code with the tensorflow v2.12 and it was executed without any issues. Kindly find the gist of it here and also looks like the error which you stated was related to tensorflow.

The error which was mentioned was discussed in Tornado issue 2008. The current theory is that it only happens when threads are being used incorrectly, but this is not certain.
Reference. Thank you!

@tilakrayal tilakrayal added comp:apis Highlevel API related issues stat:awaiting response Status - Awaiting response from author labels Apr 14, 2023
@CaffineAddic
Copy link
Author

Thanks for the reply, actually the code works well for 150 epochs with 100 steps per epoch but it stops with this error at any range of epochs from 70 to 200.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 14, 2023
@CaffineAddic
Copy link
Author

https://github.com/CaffineAddic/HybridMorph-proof-of-concept-.git

Can you see if this one works, I used this code to train the models before but now after the update it's failing

@tilakrayal
Copy link
Contributor

@CaffineAddic,
While I was accessing the issue, I was unable to view any code in the above link. Could you please provide the colab gist which helps to analyse the issue in an effective way. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Apr 28, 2023
@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 28, 2023
@CaffineAddic
Copy link
Author

I am running it on my local machine

@tilakrayal
Copy link
Contributor

Could you please provide the model_loc = 'Models/ and csv_loc = 'CSV/' datasets and the models which you are trying to execute the code in the reproducible format. Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label May 13, 2023
@CaffineAddic
Copy link
Author

There is no data-set needed to run this
One of the data-set is provided by the library other one is generated during the execution.
Those two location are just to store the model weights as .h5 file and the other has the location of the CSV file where it will store errors per step.
Just keep both as random temp folder.
Thank you

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 13, 2023
@Floppa2003
Copy link

Having the same issue right now while training models, tornado version 6.3.2 and tensorflow version 2.12.0.

@CaffineAddic
Copy link
Author

Anyone got any fix??

@tilakrayal
Copy link
Contributor

@CaffineAddic,
Apologies for the delay. We are working on the issue and will update the status here. Thank you!

@CaffineAddic
Copy link
Author

Thanks a lot for the reply, Good luck.

@akellehe
Copy link

akellehe commented Jun 8, 2023

(also seeing this issue)

@sachinprasadhs
Copy link
Contributor

@CaffineAddic @akellehe , Could you please try to provide more information on this to debug the root cause of the issue.
Also, are you facing the similar behavior in other environments as well?

@sachinprasadhs sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Jun 26, 2023
@CaffineAddic
Copy link
Author

Yess, over multiple systems configurations, I have tried run it over multiple fresh installs of tf 2.12

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2023
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 28, 2023
@andsav
Copy link

andsav commented Jul 23, 2023

I am getting the same error and stacktrace with a pytorch model with MPS backend from a jupyter notebook. The model continues training, but output stops streaming to jupyter.
I suspect the problem is actually with jupyter and that websocket that allows streaming data from the python backend to the output cell.

@CaffineAddic
Copy link
Author

Did you find any fix for it ??

@zhangyilin
Copy link

Same here. The error just random shows up. Some times after 100 epochs, sometimes at much earlier.

2023-07-26 17:43:54 [E 00:43:54.737 NotebookApp] Uncaught exception in zmqstream callback
2023-07-26 17:43:54 Traceback (most recent call last):
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
2023-07-26 17:43:54 self._handle_recv()
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
2023-07-26 17:43:54 self._run_callback(callback, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
2023-07-26 17:43:54 f = callback(*args, **kwargs)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
2023-07-26 17:43:54 return callback(self, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
2023-07-26 17:43:54 super()._on_zmq_reply(stream, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
2023-07-26 17:43:54 self.write_message(msg, binary=isinstance(msg, bytes))
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 339, in write_message
2023-07-26 17:43:54 return self.ws_connection.write_message(message, binary=binary)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1086, in write_message
2023-07-26 17:43:54 fut = self._write_frame(True, opcode, message, flags=flags)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1061, in _write_frame
2023-07-26 17:43:54 return self.stream.write(frame)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 540, in write
2023-07-26 17:43:54 self._write_buffer.append(data)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
2023-07-26 17:43:54 b += data # type: ignore
2023-07-26 17:43:54 BufferError: Existing exports of data: object cannot be re-sized
2023-07-26 17:43:54 Exception in callback BaseAsyncIOLoop._handle_events(28, 1)
2023-07-26 17:43:54 handle: <Handle BaseAsyncIOLoop._handle_events(28, 1)>
2023-07-26 17:43:54 Traceback (most recent call last):
2023-07-26 17:43:54 File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
2023-07-26 17:43:54 self._context.run(self._callback, *self._args)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 206, in _handle_events
2023-07-26 17:43:54 handler_func(fileobj, events)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
2023-07-26 17:43:54 self._handle_recv()
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
2023-07-26 17:43:54 self._run_callback(callback, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
2023-07-26 17:43:54 f = callback(*args, **kwargs)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
2023-07-26 17:43:54 return callback(self, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
2023-07-26 17:43:54 super()._on_zmq_reply(stream, msg)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
2023-07-26 17:43:54 self.write_message(msg, binary=isinstance(msg, bytes))
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 339, in write_message
2023-07-26 17:43:54 return self.ws_connection.write_message(message, binary=binary)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1086, in write_message
2023-07-26 17:43:54 fut = self._write_frame(True, opcode, message, flags=flags)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1061, in _write_frame
2023-07-26 17:43:54 return self.stream.write(frame)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 540, in write
2023-07-26 17:43:54 self._write_buffer.append(data)
2023-07-26 17:43:54 File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
2023-07-26 17:43:54 b += data # type: ignore
2023-07-26 17:43:54 BufferError: Existing exports of data: object cannot be re-sized

@tgoMota
Copy link

tgoMota commented Aug 7, 2023

Some random times I was getting this error training the keras models and the training was being stopped.
Seems like the problem is on threads sync streaming the large output.

So, turning off the verbose of fit() method worked for me.
i.e: model.fit(trainX, trainY, ... , verbose=0)

I guess you can also use verbose=2 for showing just the final details for each epoch and it will work fine.

@CaffineAddic
Copy link
Author

I will test it thank you @tgoMota

@CaffineAddic
Copy link
Author

Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[E 16:53:16.352 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/event loop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/ base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/w ebsocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/i ostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
Exception in callback BaseAsyncIOLoop._handle_events(33, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(33, 1)>
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/asyncio/events.py", line 80 , in _run
self._context.run(self._callback, *self._args)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/platf orm/asyncio.py", line 206, in _handle_events
handler_func(fileobj, events)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop /zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/serv ices/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base /zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/webso cket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 546, in write
self._handle_write()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostr eam.py", line 182, in advance
assert 0 < size <= self._size
AssertionError

Still this error is persistent @tgoMota

@camda03
Copy link

camda03 commented Oct 29, 2023

I've encountered the same or a similar issue with TF 2.13.

docker run --gpus all --rm -u $(id -u):$(id -g) -p 8888:8888 -p 6006:6006 -v $PWD/:/tf/david_home tensorflow/tensorflow:2.13.0-gpu-jupyter

In my case output of the notebook to Firefox has frozen but the training job still seems to be running.
FWIW when I run the same notebook using TF 2.13 with the LambdaStack, this does not happen i.e. output works as expected.

E 12:40:01.689 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[E 12:40:01.691 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
ERROR:asyncio:Exception in callback BaseAsyncIOLoop._handle_events(29, 1)
handle: <Handle BaseAsyncIOLoop._handle_events(29, 1)>
Traceback (most recent call last):
File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 192, in _handle_events
handler_func(fileobj, events)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 634, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 663, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/usr/local/lib/python3.8/dist-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 334, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1081, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/usr/local/lib/python3.8/dist-packages/tornado/websocket.py", line 1056, in _write_frame
return self.stream.write(frame)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 533, in write
self._write_buffer.append(data)
File "/usr/local/lib/python3.8/dist-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized

@CaffineAddic
Copy link
Author

as @tgoMota mentioned try keeping

verbose=2

This worked for me

@sachinprasadhs
Copy link
Contributor

@CaffineAddic , Is this still an issue? if the issue is resolved by changing verbose=0 to verbose=2, could you please close the issue.
Also, use the latest TensorFlow version to get the latest updates. Thank you!

@sachinprasadhs sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Nov 6, 2023
@CaffineAddic
Copy link
Author

Ya essentially allowed me to start training but I cannot comment on the rest of users, also shouldn't training should happen with the default verbose value, tornado errors are still there if you want to I can close this issue, Thank you for your time and support.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Nov 6, 2023
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 6, 2023
@CaffineAddic
Copy link
Author

Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(bccd1408-a4a4-4810-805f-d10d0d4585df)>
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 188, in send_ping
self.ping(b'')
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 445, in ping
self.ws_connection.write_ping(data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1101, in write_ping
self._write_frame(True, 0x9, data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 540, in write
self._write_buffer.append(data)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/iostream.py", line 157, in append
b += data # type: ignore
BufferError: Existing exports of data: object cannot be re-sized
[I 01:36:07.003 NotebookApp] Saving file at /New_BRaTS/Brain_data/HybridMorph_proof of concept.ipynb
[E 01:36:28.774 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/saumya/anaconda3/envs/tf/lib/python3.9/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary

Same error with verbose = 2. Any fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.12 For issues related to Tensorflow 2.12 type:bug Bug
Projects
None yet
Development

No branches or pull requests

9 participants