New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[slurm] wandb hangs at the end of jobs in dryrun mode #919
Comments
Thanks for reporting this! @raubitsj could you please take a look |
Hey @bknyaz |
I am having this problem as well. Has anyone ever figured this out? |
@lukekenworthy can you provide an example script? If you're using multiprocessing in your scripts you may need to explicity call |
I am having same issue when using wandb.sweep, where can I put wandb.finish() exactly in the script? |
I'm dealing with the same issue, any solutions for this yet? |
Same issue here. Seems no update? |
@ssadhukha or @zdhNarsil can you share an example script that gets you into this state? |
Same issue here |
I'm running into this state when I specify a different run inside a for loop. The main parts of my script below:
|
Same issue here, still no update? |
same issue here, any update? |
@adrialopezescoriza @cherry-nancy @davidoort could you please provide a small repro so we could debug the issue and hopefully resolve it for you? |
@ssadhukha just to verify something in your repro: #919 (comment) |
Hey folks, we implemented network logging and file pusher timeout for better debugging. If you are still running into this issue, then please share a small repro as my colleague asked above. Please try setting the env vars as suggested in this PR. |
wandb --version && python --version && uname
Description
I'm using wandb on the GPU cluster with slurm to run jobs.
After the script finishes, wandb prints the following:
The problem is that the slurm scheduler doesn't quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?
Not sure if the issue is with wandb or with the cluster I'm using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.
Other clusters I've used with Ubuntu and Internet access worked fine.
I use WANDB_MODE=dryrun, because the cluster doesn't have access to external network.
Update
My impression is that wandb tries to connect to the server after the script is finished, but because
there is no connection, it raises some exception and the process gets stuck for some reason.
In one of my log files I found an additional line printed at the end regarding the connection:
What I Did
see above
Thanks.
The text was updated successfully, but these errors were encountered: