-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stopping because of BufferError: Existing exports of data: object cannot be re-sized or something wrong with tornado #60309
Comments
@CaffineAddic, The error which was mentioned was discussed in Tornado issue 2008. The current theory is that it only happens when threads are being used incorrectly, but this is not certain. |
Thanks for the reply, actually the code works well for 150 epochs with 100 steps per epoch but it stops with this error at any range of epochs from 70 to 200. |
https://github.com/CaffineAddic/HybridMorph-proof-of-concept-.git Can you see if this one works, I used this code to train the models before but now after the update it's failing |
@CaffineAddic, |
I am running it on my local machine |
Could you please provide the model_loc = 'Models/ and csv_loc = 'CSV/' datasets and the models which you are trying to execute the code in the reproducible format. Thank you! |
There is no data-set needed to run this |
Having the same issue right now while training models, tornado version 6.3.2 and tensorflow version 2.12.0. |
Anyone got any fix?? |
@CaffineAddic, |
Thanks a lot for the reply, Good luck. |
(also seeing this issue) |
@CaffineAddic @akellehe , Could you please try to provide more information on this to debug the root cause of the issue. |
Yess, over multiple systems configurations, I have tried run it over multiple fresh installs of tf 2.12 |
I am getting the same error and stacktrace with a pytorch model with MPS backend from a jupyter notebook. The model continues training, but output stops streaming to jupyter. |
Did you find any fix for it ?? |
Same here. The error just random shows up. Some times after 100 epochs, sometimes at much earlier. 2023-07-26 17:43:54 [E 00:43:54.737 NotebookApp] Uncaught exception in zmqstream callback |
Some random times I was getting this error training the keras models and the training was being stopped. So, turning off the verbose of fit() method worked for me. I guess you can also use verbose=2 for showing just the final details for each epoch and it will work fine. |
I will test it thank you @tgoMota |
Uncaught exception in ZMQStream callback Still this error is persistent @tgoMota |
I've encountered the same or a similar issue with TF 2.13. docker run --gpus all --rm -u In my case output of the notebook to Firefox has frozen but the training job still seems to be running. E 12:40:01.689 NotebookApp] Uncaught exception in ZMQStream callback |
as @tgoMota mentioned try keeping verbose=2 This worked for me |
@CaffineAddic , Is this still an issue? if the issue is resolved by changing verbose=0 to verbose=2, could you please close the issue. |
Ya essentially allowed me to start training but I cannot comment on the rest of users, also shouldn't training should happen with the default verbose value, tornado errors are still there if you want to I can close this issue, Thank you for your time and support. |
Exception in callback <bound method WebSocketMixin.send_ping of ZMQChannelsHandler(bccd1408-a4a4-4810-805f-d10d0d4585df)> Same error with verbose = 2. Any fixes |
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
No
Source
source
Tensorflow Version
2.12.0
Custom Code
Yes
OS Platform and Distribution
NAME="CentOS Linux" VERSION="7 (Core)"
Mobile device
NAME="CentOS Linux" VERSION="7 (Core)"
Python version
3.9.16
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
11.8.0
GPU model and memory
No response
Current Behaviour?
The model training would just stop abruptly
https://colab.research.google.com/drive/1WiqyF7dCdnNBIANEY80Pxw_mVz4fyV-S?usp=sharing
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: