GPU memory leak while running sweeps #1247

leopd · 2020-09-18T18:14:42Z

wandb --version && python --version && uname

Weights and Biases version: 0.9.7
Python version: 3.7.9
Operating System: Ubuntu 18.04LTS

Description

I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I've read that sometimes nvidia-smi -r will fix this, but it's never let me reset the GPU that way I think because X-windows is running on it.)

What I Did

This is not a great bug report, because I don't know how to repro it. I'm not even sure it's anything to do with wandb, or just some bug between CUDA & pytorch or something. But I've seen it three or four times now, and only when running wandb sweeps. I've mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it's maybe got something to do with the way the agent kills the python process that's using the GPU - maybe it's not cleaning up properly.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-09-18T18:14:44Z

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.93. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

tyomhak · 2020-09-22T12:30:12Z

Thanks for reporting! We're looking into this.

zoeyuchao · 2020-10-15T06:38:30Z

hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info "ctrl+c pressed" and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.

tyomhak · 2020-10-19T13:33:01Z

hey @zoeyuchao , is this happening to you in an older version wandb, or the latest one?

zoeyuchao · 2020-10-21T03:44:19Z

hey @zoeyuchao , is this happening to you in an older version wandb, or the latest one?

I tried the older versions and the latest one, still get the same issue.

tyomhak · 2020-10-21T10:56:48Z

I see, thanks for following up. We're looking into fixing this issue.

Odrec · 2020-11-17T07:29:06Z

I am also having a similar problem. I tried it with older versions and now with 0.10.10 and still the same problem that the GPU runs out of memory on the 3rd or 4th run of the sweep

MLBurnham · 2020-12-06T20:04:43Z

I'm also having this issue. Roughly 10% of memory (1 gig) isn't being released between runs. I'm using version 0.10.12 on ubuntu 20.04 with cuda 11.0 to train a classification model with the simple transformers library. you can clearly see the leak in successive runs in the attached screenshot.

ariG23498 · 2020-12-21T04:44:39Z

Hey @MLBurnham thanks for the update
Could you share a simple working script with us so that we can reproduce this issue?
Thanks is advance 😄

reymondzzzz · 2020-12-30T15:27:44Z

I have same problem. It happens after training and validation before new epoch. I don't log gradients and parameters, only metrics. Memory doesn't leak on all gpu. See screenshot(1 epoch =~24.5hours). Second time it was on 4th epoch and I have out of memory.

Using version 0.10.10.

vanpelt · 2020-12-30T16:52:31Z

Hey @reymondzzzz I'm assuming this is within the context of a sweep? If you're not logging gradients I can't think of any wandb feature that would interact with GPU memory. In the context of a sweep it could be possible that multiple processes are trying to use the same GPU if not configured properly. We would need more information about the environment and the exact behavior to understand what the root cause could be.

MLBurnham · 2020-12-30T19:25:50Z

Hey @MLBurnham thanks for the update
Could you share a simple working script with us so that we can reproduce this issue?
Thanks is advance

Sorry for the slow reply. Here's the sweep script I ran. The memory leak only occurs when I run the sweep. When running the sweep I can run ~4 iterations of the model before I run out of memory. When doing a manual grid search with the out of the box simple transformers library, I can run dozens of iterations of the model back to back.

# config
sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "accuracy", "goal": "maximize"},
    "parameters": {
        "num_train_epochs": {"min": 1, "max": 5},
        "learning_rate": {"min": 0, "max": 0.00010000
},
    },
#    "early_terminate": {"type": "hyperband", "min_iter":6,},
}

sweep_id = wandb.sweep(sweep_config, project="Political Tweet Classifier")

# logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# data import
df = pd.read_csv("Tweets/oct_sample.csv")
df = df[['text', 'political']].dropna()

# train test split
train = df.sample(frac = 0.8, random_state = 907)
test = df.drop(train.index).reset_index(drop=True)
train = train.reset_index(drop = True)
train_df = train[['text', 'political']]
eval_df = test[['text', 'political']]
# convert strings to lower
train_df['text'] = train_df['text'].str.lower()
eval_df['text'] = eval_df['text'].str.lower()

# args
model_args = ClassificationArgs()
model_args.evaluate_during_training = True
model_args.evaluate_during_training_silent = False
model_args.evaluate_during_training_steps = 100
model_args.learning_rate = 4e-4
model_args.manual_seed = 907
model_args.max_seq_length = 256
model_args.no_cache = True
model_args.no_save = True
model_args.num_train_epochs = 10
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.train_batch_size = 16
model_args.eval_batch_size = 16
model_args.train_custom_parameters_only = False
model_args.wandb_project = "Political Tweet Classifier"

# training function
def train():
    # Initialize a new wandb run
    wandb.init(resume = True)

    # Create a TransformerModel
    model = ClassificationModel(
        "electra",
        "google/electra-base-discriminator",
        use_cuda=True,
        args=model_args,
        sweep_config=wandb.config,
    )

    # Train the model
    model.train_model(
        train_df,
        eval_df=eval_df,
        accuracy=lambda truth, predictions: accuracy_score(
            truth, [round(p) for p in predictions]
        ),
    )

    # Sync wandb
    wandb.join()

# train
wandb.agent('ab5cvrjn', train)

vanpelt · 2020-12-30T19:49:01Z

Ahh, my guess is something in the TransformerModel is holding onto gpu memory between iterations. If you can refactor the code to use process based executions that should address the issue. Assuming you can write the above code to a file named train.py, something like the following should work:

# Add the following to your sweep config
# program: "train.py"

# Add the following to the end of your script
if __name__ == "__main__":
    train()

Then instead of calling wandb.agent(...) inside of python, call it from the command line: wandb agent SWEEP_ID. This will launch each trial in a separate process which should guarantee you don't hold onto any memory.

smith-nathanh · 2021-01-02T17:13:38Z

Hi @vanpelt is there another solution besides running script from command line? Using google colab for GPU access which uses notebooks, which as I understand can cause some issues with wandb sweeps.

vanpelt · 2021-01-03T02:05:46Z

Hey @nhsmith85 we'll look into the root cause for in process sweeps, but you should be able to run a script from colab pretty easily. You can use the the %%writefile train.py magic to write a file, and then call !wandb agent SWEEP_ID from a different cell.

github-actions · 2021-03-05T00:44:00Z

This issue is stale because it has been open 60 days with no activity.

joawar · 2021-03-24T18:59:49Z

I'm having the same issue huggingface/transformers#10885

borisdayma · 2021-03-26T03:38:12Z

@jwa018 With huggingface, you could try: os.environ['WANDB_WATCH'] = 'false'

Otherwise if you're doing a parameter search, you should check out the sweeps documentation as you can structure your code to run as independent processes (not sure if it's what you do already).

goncalomcorreia · 2021-11-03T17:59:39Z

I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory.

This is also happening to me when I'm running sweeps. After the agent finishes the first job, the memory in the GPU is not freed and the next job doesn't run due to a CUDA out of memory error.

Did anyone find a fix?

vanpelt · 2021-11-03T18:48:28Z

@goncalomcorreia running sweeps from within a single process using wandb.agent can cause these errors if you aren't manually freeing the objects that connect to your GPU in the train function you passed to wandb.agent. The simplest solution is to use our command line agent. This way you run wandb agent from the command line and it will execute a program you specify for each trial thereby guaranteeing all object release GPU memory. Example documentation here: https://docs.wandb.ai/guides/sweeps/quickstart#4.-launch-agent-s

goncalomcorreia · 2021-11-03T18:59:26Z

Hi! thanks for the help @vanpelt! I think this is what I'm doing. I'm running wandb agent goncalomcorreia/*project_name*/pctdctfe and this happens after the first trial.

In my sweep configuration I have:

command:
- python
- -m
- nmt.train
- --config
- nmt_config.yaml
- ${args}
early_terminate:
  eta: 3
  max_iter: 27
  s: 2
  type: hyperband
method: grid
metric:
  goal: maximize
  name: val_BLEU
parameters:
  model.init_args.prior_latent_size:
    values:
    - 128
    - 256
    - 512

vanpelt · 2021-11-03T22:49:39Z

This either means your previous trial didn't exit cleanly which is really strange. Or it means there are other processes on the box consuming GPU memory. When executing agents this way we execute the python command, wait for it to exit, then execute a new python command. When python exits it should release all GPU memory, so something else must be around holding onto it. You could try running nvidia-smi as this is happening to see the PID of the processes using GPU memory, then do a ps aux | grep $PID to see what's holding onto the ram.

goncalomcorreia · 2021-11-04T12:57:33Z

Before the process ends:

(3.9.7) goncalo@server:~$ ps aux | grep 2559215
goncalo  2559215 94.0 24.7 58570208 32577444 pts/8 Dl+ 00:32 499:48 /home/goncalo/.pyenv/versions/nmt/bin/python -m nmt.train --config nmt_config.yaml  --model.init_args.prior_latent_size 2048
goncalo  2905447  0.0  0.0   6432  1968 pts/19   S+   09:23   0:00 grep --color=auto 2559215

After the process finishes, nothing appears in nvidia-smi but the memory is being used as if the process was still running. ps aux | grep $PID now gives:

(3.9.7) goncalo@server:~$ ps aux | grep 2559215
goncalo  2926176  0.0  0.0   6432   676 pts/19   S+   12:46   0:00 grep --color=auto 2559215

vanpelt · 2021-11-04T21:47:05Z

What does nivida-smi say when the process isn't running?

goncalomcorreia · 2021-11-04T22:26:51Z

exactly as @leopd described above:

nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn't report any processes that owns that memory.

vanpelt · 2021-11-05T06:11:13Z

@goncalomcorreia what does it say about used memory? Can you just copy the output of nvidia-smi and share it here?

aidanjdonohue · 2021-11-10T00:34:47Z

@goncalomcorreia bumping Chris' suggestion

goncalomcorreia · 2021-11-10T08:09:18Z

here it is:

goncalo@server:~$ nvidia-smi
Wed Nov 10 08:02:02 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 37%   63C    P2   15W / 250W  |   1MiB / 11019MiB    |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 35%   57C    P2    8W / 250W  |   1MiB / 11019MiB    |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
| 48%   84C    P2   15W / 250W  |   9012MiB / 11016MiB |     0%      Default  |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|                                                                             |
+-----------------------------------------------------------------------------+

Notice how in GPU 2 the memory was not released.

aidanjdonohue · 2021-12-08T20:44:25Z

@goncalomcorreia Are you still experiencing these types of issues with our most recent version of the client library 0.12.7?

Jacfger · 2022-11-10T18:28:57Z

I'm actually still having this problem. And it's very similar to what happened before. nvidia-smi show nothing in process but memory is used. But in nvtop (https://github.com/Syllo/nvtop), it shows a ghost process there and it's using the memory. But it's not killable (i.e. killing it using the PID shown there doesn't do anything, but also nvtop failed to see the process command).

So in my experience, if that happens, the python process is actually still there. If I kill all process kill ( ps aux | grep <my python script> | awk '{print $2}'), the memory usage will be gone. or another way would be I can also do fuser /dev/nvidia* -k to kill all process using the GPU (but it also kills all other process using the GPU, not very nice). So given such experience, would that make sense to say that wandb didn't actually cleanup the child process correctly perhaps? But the thing is it's not happening everytime. (like same scripts, training 1 epochs didn't have problem but 250 epochs had. could it be some other problems happening to wandb like network problem which makes this happened?) But it happens to multiple different scripts of mine which is quite annoying. (also not using wandb sweep was totally fine.)

I know this thread is kinda old but I don't really know how to search this issue and this is the only one I found that was related to my case.

My current wandb version is 0.12.17.

PS: I saw some comments on watching the model parameters (which I did use), so I'm trying to do it without the watch.

h3x4g0ns · 2023-03-07T08:15:47Z

Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick.

# main training loop
def train(config=None):
  with wandb.init(config=config):
    config = wandb.config

    ...

  # cleanup
  del model
  torch.cuda.empty_cache()


# running sweep
wandb.agent(sweep_id, train)
wandb.finish()

That way you can still use the python approach without having to define the .yaml file.

Jacfger · 2023-03-07T08:20:31Z

Albeit closed, I thought I would offer my 2 cents on this since I was wrestling with this issue for the past few days. Looks like freeing up the resources at the end of the training functions does the trick.
# main training loop
def train(config=None):
  with wandb.init(config=config):
    config = wandb.config

    ...

  # cleanup
  del model
  torch.cuda.empty_cache()


# running sweep
wandb.agent(sweep_id, train)
wandb.finish()
That way you can still use the python approach without having to define the .yaml file.

This doesn't help the "memory leak" that occurs when I Ctrl+C to terminate the programs though.

issue-label-bot bot added the ty:bug type of the issue is a bug label Sep 18, 2020

github-actions bot added the stale label Mar 5, 2021

aidanjdonohue closed this as completed Dec 14, 2021

Jacfger mentioned this issue Dec 22, 2022

GPU memory leak while running sweeps, cont. #4684

Closed

Othergreengrasses mentioned this issue Mar 8, 2024

Possible cure for W&B memory leak CUNY-CL/yoyodyne#167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory leak while running sweeps #1247

GPU memory leak while running sweeps #1247

leopd commented Sep 18, 2020 •

edited

issue-label-bot bot commented Sep 18, 2020

tyomhak commented Sep 22, 2020

zoeyuchao commented Oct 15, 2020 •

edited

tyomhak commented Oct 19, 2020

zoeyuchao commented Oct 21, 2020

tyomhak commented Oct 21, 2020

Odrec commented Nov 17, 2020

MLBurnham commented Dec 6, 2020 •

edited

ariG23498 commented Dec 21, 2020

reymondzzzz commented Dec 30, 2020

vanpelt commented Dec 30, 2020

MLBurnham commented Dec 30, 2020 •

edited

vanpelt commented Dec 30, 2020

smith-nathanh commented Jan 2, 2021

vanpelt commented Jan 3, 2021 •

edited

github-actions bot commented Mar 5, 2021

joawar commented Mar 24, 2021

borisdayma commented Mar 26, 2021

goncalomcorreia commented Nov 3, 2021

vanpelt commented Nov 3, 2021 •

edited

goncalomcorreia commented Nov 3, 2021

vanpelt commented Nov 3, 2021

goncalomcorreia commented Nov 4, 2021

vanpelt commented Nov 4, 2021

goncalomcorreia commented Nov 4, 2021

vanpelt commented Nov 5, 2021

aidanjdonohue commented Nov 10, 2021

goncalomcorreia commented Nov 10, 2021

aidanjdonohue commented Dec 8, 2021

Jacfger commented Nov 10, 2022 •

edited

h3x4g0ns commented Mar 7, 2023

Jacfger commented Mar 7, 2023

GPU memory leak while running sweeps #1247

GPU memory leak while running sweeps #1247

Comments

leopd commented Sep 18, 2020 • edited

Description

What I Did

issue-label-bot bot commented Sep 18, 2020

tyomhak commented Sep 22, 2020

zoeyuchao commented Oct 15, 2020 • edited

tyomhak commented Oct 19, 2020

zoeyuchao commented Oct 21, 2020

tyomhak commented Oct 21, 2020

Odrec commented Nov 17, 2020

MLBurnham commented Dec 6, 2020 • edited

ariG23498 commented Dec 21, 2020

reymondzzzz commented Dec 30, 2020

vanpelt commented Dec 30, 2020

MLBurnham commented Dec 30, 2020 • edited

vanpelt commented Dec 30, 2020

smith-nathanh commented Jan 2, 2021

vanpelt commented Jan 3, 2021 • edited

github-actions bot commented Mar 5, 2021

joawar commented Mar 24, 2021

borisdayma commented Mar 26, 2021

goncalomcorreia commented Nov 3, 2021

vanpelt commented Nov 3, 2021 • edited

goncalomcorreia commented Nov 3, 2021

vanpelt commented Nov 3, 2021

goncalomcorreia commented Nov 4, 2021

vanpelt commented Nov 4, 2021

goncalomcorreia commented Nov 4, 2021

vanpelt commented Nov 5, 2021

aidanjdonohue commented Nov 10, 2021

goncalomcorreia commented Nov 10, 2021

aidanjdonohue commented Dec 8, 2021

Jacfger commented Nov 10, 2022 • edited

h3x4g0ns commented Mar 7, 2023

Jacfger commented Mar 7, 2023

leopd commented Sep 18, 2020 •

edited

zoeyuchao commented Oct 15, 2020 •

edited

MLBurnham commented Dec 6, 2020 •

edited

MLBurnham commented Dec 30, 2020 •

edited

vanpelt commented Jan 3, 2021 •

edited

vanpelt commented Nov 3, 2021 •

edited

Jacfger commented Nov 10, 2022 •

edited