New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI] Network error (ReadTimeout), entering retry loop. #2039
Comments
I believe this might be related to #780 but I'm not certain |
Hey @prash-p Does this problem still persist? |
Hey, I actually downgraded wandb to 0.10.23 to avoid having this error. It wasn't related to colab/.ipynbs/jupyter, but a script I was running from the command line. It was also not straight forward to reproduce. |
This issue is stale because it has been open 60 days with no activity. |
I'm also experiencing this issue with |
I think they are experiencing some server problems. I'm having the same issue since this morning |
@raoulraft @cfierro94 we did experience some degraded performance for about 30 minutes starting an hour ago, but we've mostly recovered. If you're still seeing errors like this, can you please share the |
Sure! I've tried several runs this morning. This is one of them: 2021-09-15 10:36:14,549 INFO MainThread:12420 [internal.py:wandb_internal():63] W&B internal server running at pid: 12420 |
Same happening from my side. Within the debug-internal.log I found this:
Any clue why is this happening? |
it still happening for me, here's the debug-internal.log
|
Same problem here.
|
Working fine now. It seems that it was on their end. |
The same problem starts hours ago. My codes work well previously. But with the same code, the network error occurs. |
I'm having the same issue as well with previously working code. The issue started a couple of hours ago. |
Also having this issue again. OS: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-centos-7.9.2009-Core |
We had an outage this morning related to tag usage that is now resolved. If any users are still seeing this currently, please share a snippet of how you're calling |
Thanks! I can run with
|
I've the same problem but I run a hosted version of wandb. But apparently in my case, if I add a little bit more information to the artifact metadata, it triggers this situation. |
Hello everybody, just wanted to let you know that I've been experiencing the same issue for some weeks now.
Thanks for your help! |
Hi @CrohnEngineer, We are currently on If it doesn't, please let us know and we can pick this issue up from there. Thanks, |
Hi @CrohnEngineer, We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved. |
Hey @ramit-wandb , Thank you for getting in touch! |
Hey @ramit-wandb , Just wanted to let you know that upgrading to |
Hi, Can I somehow check if the problem is server-side or if it is client side? Here are my debug.log screenshots |
I'm getting this issue as well after hours of training on a cluster? Any idea as to why this is happening? |
Similar problem.... just tried a set of hyperparameter sweeps... Sometime in the middle I get a 'wandb: Network error (ReadTimeout), entering retry loop.' Would it be better/more successful to keep the data local during the sweeps and then somehow sync after? |
Hi, same issue here. After a few hours of training I get, |
I am facing the same error using the latest wandb version |
I have same issue in latest wandb version |
I'm facing the same problem using the latest version of wandb. My internet environment gets worse every night, and just at that time my Python program seems to stop responding "wandb: Network error (ConnectionError), entering retry loop", I would like to if we could offer any solution. |
When my sweep was working, I encountered same issue in latest wandb version 0.13.4 The errors are as follows: wandb: Network error (ReadTimeout), entering retry loop. |
I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer? Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
0%| | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
0%| | 1/14272 [00:11<25:10:01, 6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
0%| | 2/14272 [00:16<22:40:19, 5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
0%| | 3/14272 [00:22<21:38:07, 5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
0%| | 4/14272 [00:27<21:18:49, 5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
0%| | 5/14272 [00:32<20:58:47, 5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
0%| | 6/14272 [00:38<25:34:01, 6.45s/it, train_loss=1.87]
Traceback (most recent call last):
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
main()
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
accelerator.log({"train_loss": loss.item()}, step=batch_idx)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
return PartialState().on_main_process(function)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
return function(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
self.run.log(values, step=step, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
self._log(data=data, step=step, commit=commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
self._partial_history_callback(data, step, commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
self._publish_partial_history(partial_history)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock I hope the official can resolve this issue as soon as possible. |
Description
My program just hangs after wandb cannot connect and log data. The error is
Network error (ReadTimeout), entering retry loop. See wandb/debug-internal.log for full traceback.
Wandb features
wandb.log()
Environment
Here is part of the debug-internal.log:
The text was updated successfully, but these errors were encountered: