Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory leak while running sweeps #1247

Closed
leopd opened this issue Sep 18, 2020 · 32 comments
Closed

GPU memory leak while running sweeps #1247

leopd opened this issue Sep 18, 2020 · 32 comments
Labels
ty:bug type of the issue is a bug

Comments

@leopd
Copy link

leopd commented Sep 18, 2020

wandb --version && python --version && uname

Weights and Biases version: 0.9.7
Python version: 3.7.9
Operating System: Ubuntu 18.04LTS

Description

I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I've read that sometimes nvidia-smi -r will fix this, but it's never let me reset the GPU that way I think because X-windows is running on it.)

What I Did

This is not a great bug report, because I don't know how to repro it. I'm not even sure it's anything to do with wandb, or just some bug between CUDA & pytorch or something. But I've seen it three or four times now, and only when running wandb sweeps. I've mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it's maybe got something to do with the way the agent kills the python process that's using the GPU - maybe it's not cleaning up properly.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.93. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the ty:bug type of the issue is a bug label Sep 18, 2020
@tyomhak
Copy link

tyomhak commented Sep 22, 2020

Thanks for reporting! We're looking into this.

@zoeyuchao
Copy link

zoeyuchao commented Oct 15, 2020

hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info "ctrl+c pressed" and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.

@tyomhak
Copy link

tyomhak commented Oct 19, 2020

hey @zoeyuchao , is this happening to you in an older version wandb, or the latest one?

@zoeyuchao
Copy link

hey @zoeyuchao , is this happening to you in an older version wandb, or the latest one?

I tried the older versions and the latest one, still get the same issue.

@tyomhak
Copy link

tyomhak commented Oct 21, 2020

I see, thanks for following up. We're looking into fixing this issue.

@Odrec
Copy link

Odrec commented Nov 17, 2020

I am also having a similar problem. I tried it with older versions and now with 0.10.10 and still the same problem that the GPU runs out of memory on the 3rd or 4th run of the sweep

@MLBurnham
Copy link

MLBurnham commented Dec 6, 2020

I'm also having this issue. Roughly 10% of memory (1 gig) isn't being released between runs. I'm using version 0.10.12 on ubuntu 20.04 with cuda 11.0 to train a classification model with the simple transformers library. you can clearly see the leak in successive runs in the attached screenshot.
Screenshot from 2020-12-06 15-02-49

@ariG23498
Copy link
Contributor

Hey @MLBurnham thanks for the update
Could you share a simple working script with us so that we can reproduce this issue?
Thanks is advance 😄

@reymondzzzz
Copy link

I have same problem. It happens after training and validation before new epoch. I don't log gradients and parameters, only metrics. Memory doesn't leak on all gpu. See screenshot(1 epoch =~24.5hours). Second time it was on 4th epoch and I have out of memory.
image
Using version 0.10.10.

@vanpelt
Copy link
Contributor

vanpelt commented Dec 30, 2020

Hey @reymondzzzz I'm assuming this is within the context of a sweep? If you're not logging gradients I can't think of any wandb feature that would interact with GPU memory. In the context of a sweep it could be possible that multiple processes are trying to use the same GPU if not configured properly. We would need more information about the environment and the exact behavior to understand what the root cause could be.

@MLBurnham
Copy link

MLBurnham commented Dec 30, 2020

Hey @MLBurnham thanks for the update
Could you share a simple working script with us so that we can reproduce this issue?
Thanks is advance

Sorry for the slow reply. Here's the sweep script I ran. The memory leak only occurs when I run the sweep. When running the sweep I can run ~4 iterations of the model before I run out of memory. When doing a manual grid search with the out of the box simple transformers library, I can run dozens of iterations of the model back to back.

# config
sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "accuracy", "goal": "maximize"},
    "parameters": {
        "num_train_epochs": {"min": 1, "max": 5},
        "learning_rate": {"min": 0, "max": 0.00010000
},
    },
#    "early_terminate": {"type": "hyperband", "min_iter":6,},
}

sweep_id = wandb.sweep(sweep_config, project="Political Tweet Classifier")

# logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# data import
df = pd.read_csv("Tweets/oct_sample.csv")
df = df[['text', 'political']].dropna()

# train test split
train = df.sample(frac = 0.8, random_state = 907)
test = df.drop(train.index).reset_index(drop=True)
train = train.reset_index(drop = True)
train_df = train[['text', 'political']]
eval_df = test[['text', 'political']]
# convert strings to lower
train_df['text'] = train_df['text'].str.lower()
eval_df['text'] = eval_df['text'].str.lower()

# args
model_args = ClassificationArgs()
model_args.evaluate_during_training = True
model_args.evaluate_during_training_silent = False
model_args.evaluate_during_training_steps = 100
model_args.learning_rate = 4e-4
model_args.manual_seed = 907
model_args.max_seq_length = 256
model_args.no_cache = True
model_args.no_save = True
model_args.num_train_epochs = 10
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.train_batch_size = 16
model_args.eval_batch_size = 16
model_args.train_custom_parameters_only = False
model_args.wandb_project = "Political Tweet Classifier"

# training function
def train():
    # Initialize a new wandb run
    wandb.init(resume = True)

    # Create a TransformerModel
    model = ClassificationModel(
        "electra",
        "google/electra-base-discriminator",
        use_cuda=True,
        args=model_args,
        sweep_config=wandb.config,
    )

    # Train the model
    model.train_model(
        train_df,
        eval_df=eval_df,
        accuracy=lambda truth, predictions: accuracy_score(
            truth, [round(p) for p in predictions]
        ),
    )

    # Sync wandb
    wandb.join()

# train
wandb.agent('ab5cvrjn', train)

@vanpelt
Copy link
Contributor

vanpelt commented Dec 30, 2020

Ahh, my guess is something in the TransformerModel is holding onto gpu memory between iterations. If you can refactor the code to use process based executions that should address the issue. Assuming you can write the above code to a file named train.py, something like the following should work:

# Add the following to your sweep config
# program: "train.py"

# Add the following to the end of your script
if __name__ == "__main__":
    train()

Then instead of calling wandb.agent(...) inside of python, call it from the command line: wandb agent SWEEP_ID. This will launch each trial in a separate process which should guarantee you don't hold onto any memory.

@smith-nathanh
Copy link

Hi @vanpelt is there another solution besides running script from command line? Using google colab for GPU access which uses notebooks, which as I understand can cause some issues with wandb sweeps.

@vanpelt
Copy link
Contributor

vanpelt commented Jan 3, 2021

Hey @nhsmith85 we'll look into the root cause for in process sweeps, but you should be able to run a script from colab pretty easily. You can use the the %%writefile train.py magic to write a file, and then call !wandb agent SWEEP_ID from a different cell.

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Mar 5, 2021
@joawar
Copy link

joawar commented Mar 24, 2021

I'm having the same issue huggingface/transformers#10885

@borisdayma
Copy link
Contributor

@jwa018 With huggingface, you could try: os.environ['WANDB_WATCH'] = 'false'

Otherwise if you're doing a parameter search, you should check out the sweeps documentation as you can structure your code to run as independent processes (not sure if it's what you do already).

@goncalomcorreia
Copy link

I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory.

This is also happening to me when I'm running sweeps. After the agent finishes the first job, the memory in the GPU is not freed and the next job doesn't run due to a CUDA out of memory error.

Did anyone find a fix?

@vanpelt
Copy link
Contributor

vanpelt commented Nov 3, 2021

@goncalomcorreia running sweeps from within a single process using wandb.agent can cause these errors if you aren't manually freeing the objects that connect to your GPU in the train function you passed to wandb.agent. The simplest solution is to use our command line agent. This way you run wandb agent from the command line and it will execute a program you specify for each trial thereby guaranteeing all object release GPU memory. Example documentation here: https://docs.wandb.ai/guides/sweeps/quickstart#4.-launch-agent-s

@goncalomcorreia
Copy link

Hi! thanks for the help @vanpelt! I think this is what I'm doing. I'm running wandb agent goncalomcorreia/*project_name*/pctdctfe and this happens after the first trial.

In my sweep configuration I have:

command:
- python
- -m
- nmt.train
- --config
- nmt_config.yaml
- ${args}
early_terminate:
  eta: 3
  max_iter: 27
  s: 2
  type: hyperband
method: grid
metric:
  goal: maximize
  name: val_BLEU
parameters:
  model.init_args.prior_latent_size:
    values:
    - 128
    - 256
    - 512

@vanpelt
Copy link
Contributor

vanpelt commented Nov 3, 2021

This either means your previous trial didn't exit cleanly which is really strange. Or it means there are other processes on the box consuming GPU memory. When executing agents this way we execute the python command, wait for it to exit, then execute a new python command. When python exits it should release all GPU memory, so something else must be around holding onto it. You could try running nvidia-smi as this is happening to see the PID of the processes using GPU memory, then do a ps aux | grep $PID to see what's holding onto the ram.

@goncalomcorreia
Copy link

Before the process ends:

(3.9.7) goncalo@server:~$ ps aux | grep 2559215
goncalo  2559215 94.0 24.7 58570208 32577444 pts/8 Dl+ 00:32 499:48 /home/goncalo/.pyenv/versions/nmt/bin/python -m nmt.train --config nmt_config.yaml  --model.init_args.prior_latent_size 2048
goncalo  2905447  0.0  0.0   6432  1968 pts/19   S+   09:23   0:00 grep --color=auto 2559215

After the process finishes, nothing appears in nvidia-smi but the memory is being used as if the process was still running. ps aux | grep $PID now gives:

(3.9.7) goncalo@server:~$ ps aux | grep 2559215
goncalo  2926176  0.0  0.0   6432   676 pts/19   S+   12:46   0:00 grep --color=auto 2559215

@vanpelt
Copy link
Contributor

vanpelt commented Nov 4, 2021

What does nivida-smi say when the process isn't running?

@goncalomcorreia
Copy link

exactly as @leopd described above:

nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory.

@vanpelt
Copy link
Contributor

vanpelt commented Nov 5, 2021

@goncalomcorreia what does it say about used memory? Can you just copy the output of nvidia-smi and share it here?

@aidanjdonohue
Copy link

@goncalomcorreia bumping Chris' suggestion

@goncalomcorreia
Copy link

here it is:

goncalo@server:~$ nvidia-smi
Wed Nov 10 08:02:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 37%   63C    P2   15W / 250W  |   1MiB / 11019MiB    |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 35%   57C    P2    8W / 250W  |   1MiB / 11019MiB    |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
| 48%   84C    P2   15W / 250W  |   9012MiB / 11016MiB |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|                                                                             |
+-----------------------------------------------------------------------------+

Notice how in GPU 2 the memory was not released.

@aidanjdonohue
Copy link

@goncalomcorreia Are you still experiencing these types of issues with our most recent version of the client library 0.12.7?

@Jacfger
Copy link

Jacfger commented Nov 10, 2022

I'm actually still having this problem. And it's very similar to what happened before. nvidia-smi show nothing in process but memory is used. But in nvtop (https://github.com/Syllo/nvtop), it shows a ghost process there and it's using the memory. But it's not killable (i.e. killing it using the PID shown there doesn't do anything, but also nvtop failed to see the process command).

image

So in my experience, if that happens, the python process is actually still there. If I kill all process kill ( ps aux | grep <my python script> | awk '{print $2}'), the memory usage will be gone. or another way would be I can also do fuser /dev/nvidia* -k to kill all process using the GPU (but it also kills all other process using the GPU, not very nice). So given such experience, would that make sense to say that wandb didn't actually cleanup the child process correctly perhaps? But the thing is it's not happening everytime. (like same scripts, training 1 epochs didn't have problem but 250 epochs had. could it be some other problems happening to wandb like network problem which makes this happened?) But it happens to multiple different scripts of mine which is quite annoying. (also not using wandb sweep was totally fine.)

I know this thread is kinda old but I don't really know how to search this issue and this is the only one I found that was related to my case.

My current wandb version is 0.12.17.
image

PS: I saw some comments on watching the model parameters (which I did use), so I'm trying to do it without the watch.

@h3x4g0ns
Copy link

h3x4g0ns commented Mar 7, 2023

Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick.

# main training loop
def train(config=None):
  with wandb.init(config=config):
    config = wandb.config

    ...

  # cleanup
  del model
  torch.cuda.empty_cache()


# running sweep
wandb.agent(sweep_id, train)
wandb.finish()

That way you can still use the python approach without having to define the .yaml file.

@Jacfger
Copy link

Jacfger commented Mar 7, 2023

Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick.

# main training loop
def train(config=None):
  with wandb.init(config=config):
    config = wandb.config

    ...

  # cleanup
  del model
  torch.cuda.empty_cache()


# running sweep
wandb.agent(sweep_id, train)
wandb.finish()

That way you can still use the python approach without having to define the .yaml file.

This doesn't help the "memory leak" that occurs when I Ctrl+C to terminate the programs though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ty:bug type of the issue is a bug
Projects
None yet
Development

No branches or pull requests