New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory leak while running sweeps #1247
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Thanks for reporting! We're looking into this. |
hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info "ctrl+c pressed" and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot. |
hey @zoeyuchao , is this happening to you in an older version wandb, or the latest one? |
I tried the older versions and the latest one, still get the same issue. |
I see, thanks for following up. We're looking into fixing this issue. |
I am also having a similar problem. I tried it with older versions and now with 0.10.10 and still the same problem that the GPU runs out of memory on the 3rd or 4th run of the sweep |
Hey @MLBurnham thanks for the update |
Hey @reymondzzzz I'm assuming this is within the context of a sweep? If you're not logging gradients I can't think of any wandb feature that would interact with GPU memory. In the context of a sweep it could be possible that multiple processes are trying to use the same GPU if not configured properly. We would need more information about the environment and the exact behavior to understand what the root cause could be. |
Sorry for the slow reply. Here's the sweep script I ran. The memory leak only occurs when I run the sweep. When running the sweep I can run ~4 iterations of the model before I run out of memory. When doing a manual grid search with the out of the box simple transformers library, I can run dozens of iterations of the model back to back.
|
Ahh, my guess is something in the TransformerModel is holding onto gpu memory between iterations. If you can refactor the code to use process based executions that should address the issue. Assuming you can write the above code to a file named train.py, something like the following should work:
Then instead of calling |
Hi @vanpelt is there another solution besides running script from command line? Using google colab for GPU access which uses notebooks, which as I understand can cause some issues with wandb sweeps. |
Hey @nhsmith85 we'll look into the root cause for in process sweeps, but you should be able to run a script from colab pretty easily. You can use the the |
This issue is stale because it has been open 60 days with no activity. |
I'm having the same issue huggingface/transformers#10885 |
@jwa018 With huggingface, you could try: Otherwise if you're doing a parameter search, you should check out the sweeps documentation as you can structure your code to run as independent processes (not sure if it's what you do already). |
This is also happening to me when I'm running sweeps. After the agent finishes the first job, the memory in the GPU is not freed and the next job doesn't run due to a Did anyone find a fix? |
@goncalomcorreia running sweeps from within a single process using |
Hi! thanks for the help @vanpelt! I think this is what I'm doing. I'm running In my sweep configuration I have:
|
This either means your previous trial didn't exit cleanly which is really strange. Or it means there are other processes on the box consuming GPU memory. When executing agents this way we execute the python command, wait for it to exit, then execute a new python command. When python exits it should release all GPU memory, so something else must be around holding onto it. You could try running |
Before the process ends:
After the process finishes, nothing appears in nvidia-smi but the memory is being used as if the process was still running.
|
What does |
exactly as @leopd described above:
|
@goncalomcorreia what does it say about used memory? Can you just copy the output of |
@goncalomcorreia bumping Chris' suggestion |
here it is:
Notice how in GPU |
@goncalomcorreia Are you still experiencing these types of issues with our most recent version of the client library 0.12.7? |
I'm actually still having this problem. And it's very similar to what happened before. nvidia-smi show nothing in process but memory is used. But in nvtop (https://github.com/Syllo/nvtop), it shows a ghost process there and it's using the memory. But it's not killable (i.e. killing it using the PID shown there doesn't do anything, but also nvtop failed to see the process command). So in my experience, if that happens, the python process is actually still there. If I kill all process I know this thread is kinda old but I don't really know how to search this issue and this is the only one I found that was related to my case. My current wandb version is 0.12.17. PS: I saw some comments on watching the model parameters (which I did use), so I'm trying to do it without the watch. |
Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick. # main training loop
def train(config=None):
with wandb.init(config=config):
config = wandb.config
...
# cleanup
del model
torch.cuda.empty_cache()
# running sweep
wandb.agent(sweep_id, train)
wandb.finish() That way you can still use the python approach without having to define the |
This doesn't help the "memory leak" that occurs when I Ctrl+C to terminate the programs though. |
wandb --version && python --version && uname
Weights and Biases version: 0.9.7
Python version: 3.7.9
Operating System: Ubuntu 18.04LTS
Description
I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where
nvidia-smi
reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I've read that sometimesnvidia-smi -r
will fix this, but it's never let me reset the GPU that way I think because X-windows is running on it.)What I Did
This is not a great bug report, because I don't know how to repro it. I'm not even sure it's anything to do with wandb, or just some bug between CUDA & pytorch or something. But I've seen it three or four times now, and only when running wandb sweeps. I've mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it's maybe got something to do with the way the agent kills the python process that's using the GPU - maybe it's not cleaning up properly.
The text was updated successfully, but these errors were encountered: