CUDA out of memory when load model #72

Andy1621 · 2020-01-08T14:57:35Z

I have train mobilenetv3_large_100 using 8 2080Ti GPU, and the batch size is 128, which means 128 * 8 =1024 pictures every batch. When I resumed the model, there was an "CUDA out of memory" error. However, when I trained it again from scratch, there wasn't any error.
I noticed that your codes of "helper.py" has loaded the model in cpu, it should be the solution for this bug, but why this happend?
checkpoint = torch.load(checkpoint_path, map_location='cpu')

Another interesting problem is that I find the acc@1 is very low in the first few epochs(nearly random property), and the eval_loss even rises, why？？？

The text was updated successfully, but these errors were encountered:

rwightman · 2020-01-08T16:52:31Z

The low initial accuracy is likely due to the warmup epochs where LR increases from a small value up to the normal learning rate before following the LR schedule. Set --warmup-epochs 0 to disable this, there is also some lag at the start in the EMA weights if you have a decay constant closer to 1. Some optimizers behave better than others during this phase, usually SGD is well behaved, Adam, and some of the adaptive ones don't like this much and may be better to disable warmup. The default Pytorch rmsprop can be unstable here but my rmsproptf variant usually handles it okay.

The resume memory issue I've seen once or twice before when running close to the GPU memory limit. I'm not quite sure why that happens. It may be partially related to a reordering of the optimizer resume I did some time ago (see commit below) to make the AMP resume correct. The interplay and ordering between Apex AMP, cuda(), DP, DDP, ModelEMA and resuming is a bit complicated.

3d9c8a6

One thing to try would be to remove the AMP state resume (I don't know if it is 100% necessary), and try moving the resume back before the model.cuda() and put the optimizer restore back before amp.init() as it was in that old commit.

Andy1621 · 2020-01-09T04:24:05Z

Thanks for your patience. It seems to be difficult for me to fix the error currently, for I don't konw the AMP and it's hard to remove it because of code dependency. Actually, there isn't apex package in my enviroment. I will install it and see whether mixed precision training help increase the accuracy. And Just like what you said, the error happended when running close to the GPU memory limit. It is normal to resume memory when the batch is 64*8=512, so the easy solution for me is just use small batch.

There is an interesting phenomenon when I decrease the batch size. The GPU memory increases every batch and decreases when the GPU memory closes to the limit, do you know why this happens? I think there are some allocation strategy for GPU memory in PyTorch I haven't known.

rwightman · 2020-01-09T05:39:22Z

@Andy1621 mixed precision won't increase the accuracy, but on volta or turing it will double the training speed and half the memory usage for these networks...

When you see the GPU memory oscillating like that, it's probably trying to find an optimal memory layout (conv algo selection)... that behaviour should go away if you change the line torch.backends.cudnn.benchmark = True in train.py to False. That can reduce the performance (speed).

I usually decrease the batch size a few notches if that happens. No need to go all the way from 128 to 64 ... generally other multiples of 8 or 16 shouldn't redue performance... so maybe try 112, 96, etc ... or install Apex, enable AMP and you can increase the batch size

Andy1621 · 2020-01-09T05:46:28Z

Thank you very much! It looks like it's already midnight in your city, good night~~

pichuang1984 · 2020-01-13T21:59:31Z

I am also seeing a similar problem, where training from scratch works but resuming will result in CUDA out of memory, so I am also dialing down the batch size by multiple(s) of 8.

rwightman · 2020-01-14T00:59:23Z

@pichuang1984 as mentioned above, you can try reorganizing the load sequence if it's a big problem

load model before cuda()
load optimizer before amp init
comment out amp restore

I was thinking of leaving a path in there that was technically incorrect with respect to the Apex AMP recommendations (like it used to be earlier in 2019) as it seems to behave better for the memory use on resume... but then it'd all get a bit messy :)

andravin · 2020-02-12T18:06:39Z

I also see the out of memory error after resume, with a model that normally uses 12GB of GPU memory out of the 15.75GB capacity.

If I cut the batch size in half, then resume succeeds; but during the first epoch, GPU 0 uses almost twice as much memory as the rest, 13.5 GB versus 6.9GB.

Also, not sure how to interpret the error message, but it either indicates severe memory fragmentation or a lot of unaccounted memory:

RuntimeError: CUDA out of memory. Tried to allocate 232.00 MiB (GPU 0; 15.75 GiB total capacity; 7.63 GiB already allocated; 186.88 MiB free; 220.04 MiB cached)

Calling torch.cuda.empty_cache() before training has no effect. The functions that report memory manager details do not appear to be available in pytorch version 1.3.1.

I tried the recommended changes to train.py. There was no change in the behavior with distributed_train.sh, but single GPU training with train.py worked. ec2 killed my instance before I had a chance to save the patch, or to see if single GPU training also worked without the patch.

If I cut batch size in half during resume, everything appears to be OK, except GPU 0 shows twice as much memory used in nvidia-smi.

andravin · 2020-02-12T19:11:26Z

Upgrading to pytorch 1.4.0 and reinstall apex had no effect (edit: with the original train.py code, have not tried recommended fix with this setup).

rwightman · 2020-02-12T19:45:50Z

@andravin I've tried monkeying around with this, never seen it quite as bad as you described (2x) but definitely I can reproduce by resuming AMP trained models that are close to the limit. I wonder if the number of distributed nodes has an impact?

I believe I followed the recommendations in APEX AMP repo correctly with respect to the sequence of model / optimizer / amp init, etc. I have no idea what is going on there and suspect (as with #80) it's beyond my control. I tried shuffling things around (going against recommendation) and tried explicitly moving some optimizer state to specific GPUs. Sometimes it looked like an improvement but then another run would crash... so I don't have a reliable work around.

andravin · 2020-02-12T20:07:32Z

I just narrowed down the memory usage explosion on device 0 to the ModelEma resume. The memory manager never seems to recover from this.

rwightman · 2020-02-12T20:31:07Z

Any obvious to(device)/cuda mistakes in the EMA resume sequence?

rwightman · 2020-02-12T20:36:24Z

Something to try, I'm not mapping the device for the load here as I do for normal checkpoint load, to either CPU or explicity to the exact GPU: https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils.py#L262

andravin · 2020-02-12T20:44:31Z

It must be something like that, as ModelEma._load_checkpoint is called 8 times, once for each device, the memory increases only happen on device 0.

andravin · 2020-02-12T20:53:21Z

torch.load() uses Python’s unpickling facilities but treats storages, which underlie tensors, specially. They are first deserialized on the CPU and are then moved to the device they were saved from

andravin · 2020-02-12T20:55:51Z

Sure enough, this fixes it:

diff --git a/timm/utils.py b/timm/utils.py
index 59d2bcd..1da69a9 100644
--- a/timm/utils.py
+++ b/timm/utils.py
@@ -259,7 +259,7 @@ class ModelEma:
             p.requires_grad_(False)
 
     def _load_checkpoint(self, checkpoint_path):
-        checkpoint = torch.load(checkpoint_path)
+        checkpoint = torch.load(checkpoint_path, map_location='cpu')
         assert isinstance(checkpoint, dict)
         if 'state_dict_ema' in checkpoint:
             new_state_dict = OrderedDict()

andravin · 2020-02-12T21:11:44Z

What I want to know is, why wasn't the pytorch memory manager able to collect this ~8GB of GPU memory after the checkpoint local variable went out of scope in ModelEma._load_checkpoint?

Also, the out of memory error, which did not occur until the first training step, seemed to indicate that there was ~8GB of used GPU memory that was unseen by the pytorch memory manager.

andravin · 2020-02-19T18:41:41Z

I just noticed that @Andy1621 was also using 8 GPUs in the original bug report. Could not tell if he was also using EMA. That would be consistent with the hypothesis that this bug is more severe when there are more devices, which is exactly what we saw with the EMA weights resume before the fix.

Since the fix, the issue is completely solved for me. I do not see extra GPU memory use on device #0 anymore, and resume never crashes. I suspect the patch also fixed the original issue reported by @Andy1621, but it would be good to get confirmation from him.

rwightman · 2020-02-19T20:19:01Z

@pichuang1984 was also using 8+ GPUs so curious if the change fixed it for him as well?

rwightman · 2020-02-25T19:05:39Z

Closing this since fixed for @andravin w/ 8 GPU and I am not noticing the usual smaller memory spike on 2 GPU resumes anymore.

aegonwolf · 2023-03-27T17:57:21Z

I am seeing the same error, I don't quite understand what you mean with load model before cuda ?

andravin · 2023-03-27T21:51:03Z

@aegonwolf this issue is ancient history, long since resolved. Are you sure you are in the right place?

It might best to create a new issue with details about the problem you are seeing and how to reproduce, then reference this issue in the new report if it still seems relevant.

rwightman added a commit that referenced this issue Feb 12, 2020

Add map_location='cpu' to ModelEma resume, should improve #72

f098fda

rwightman closed this as completed Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory when load model #72

CUDA out of memory when load model #72

Andy1621 commented Jan 8, 2020 •

edited

rwightman commented Jan 8, 2020 •

edited

Andy1621 commented Jan 9, 2020 •

edited

rwightman commented Jan 9, 2020 •

edited

Andy1621 commented Jan 9, 2020

pichuang1984 commented Jan 13, 2020

rwightman commented Jan 14, 2020 •

edited

andravin commented Feb 12, 2020 •

edited

andravin commented Feb 12, 2020 •

edited

rwightman commented Feb 12, 2020

andravin commented Feb 12, 2020

rwightman commented Feb 12, 2020

rwightman commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020 •

edited

andravin commented Feb 19, 2020

rwightman commented Feb 19, 2020

rwightman commented Feb 25, 2020

aegonwolf commented Mar 27, 2023

andravin commented Mar 27, 2023

CUDA out of memory when load model #72

CUDA out of memory when load model #72

Comments

Andy1621 commented Jan 8, 2020 • edited

rwightman commented Jan 8, 2020 • edited

Andy1621 commented Jan 9, 2020 • edited

rwightman commented Jan 9, 2020 • edited

Andy1621 commented Jan 9, 2020

pichuang1984 commented Jan 13, 2020

rwightman commented Jan 14, 2020 • edited

andravin commented Feb 12, 2020 • edited

andravin commented Feb 12, 2020 • edited

rwightman commented Feb 12, 2020

andravin commented Feb 12, 2020

rwightman commented Feb 12, 2020

rwightman commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020

andravin commented Feb 12, 2020 • edited

andravin commented Feb 19, 2020

rwightman commented Feb 19, 2020

rwightman commented Feb 25, 2020

aegonwolf commented Mar 27, 2023

andravin commented Mar 27, 2023

Andy1621 commented Jan 8, 2020 •

edited

rwightman commented Jan 8, 2020 •

edited

Andy1621 commented Jan 9, 2020 •

edited

rwightman commented Jan 9, 2020 •

edited

rwightman commented Jan 14, 2020 •

edited

andravin commented Feb 12, 2020 •

edited

andravin commented Feb 12, 2020 •

edited

andravin commented Feb 12, 2020 •

edited