-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI] Wandb stalls training loop #3223
Comments
I have the same issue, any luck? |
Hi @Guitaricet, |
Thank you, this sounds like a reasonable solution. I will try it out this week. |
Hi @Guitaricet , Thank you, |
Hi! The solution seem to be working! |
@Guitaricet Happy to hear that! Thank you for writing in. |
I had the same problem and solved it by setting the env variable |
Description
This is a very frustrating bug, which I guess is also hard to pinpoint, but I lost a couple of thousand $ on cloud compute because wandb stalled the whole program during training 馃槩. I love wandb, it is genuinely one of the best products I ever used, but I think it is not safe for me to continue using it for large-scale tasks (where it would help me quite a bit).
About an hour after starting the training script, training stops (GPUs don't run anything, a CPU process works, but does nothing). When I CTRL+C the script, the stacktrace looks like this
Wandb features
How to reproduce
Unfortunately, I did not find a simple way to reproduce it, but I get it every time I run this T5 distillation script.
Environment
The text was updated successfully, but these errors were encountered: