Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory when load model #72

Closed
Andy1621 opened this issue Jan 8, 2020 · 21 comments
Closed

CUDA out of memory when load model #72

Andy1621 opened this issue Jan 8, 2020 · 21 comments

Comments

@Andy1621
Copy link

Andy1621 commented Jan 8, 2020

I have train mobilenetv3_large_100 using 8 2080Ti GPU, and the batch size is 128, which means 128 * 8 =1024 pictures every batch. When I resumed the model, there was an "CUDA out of memory" error. However, when I trained it again from scratch, there wasn't any error.
I noticed that your codes of "helper.py" has loaded the model in cpu, it should be the solution for this bug, but why this happend?
checkpoint = torch.load(checkpoint_path, map_location='cpu')

Another interesting problem is that I find the acc@1 is very low in the first few epochs(nearly random property), and the eval_loss even rises, why???
image

@rwightman
Copy link
Collaborator

rwightman commented Jan 8, 2020

The low initial accuracy is likely due to the warmup epochs where LR increases from a small value up to the normal learning rate before following the LR schedule. Set --warmup-epochs 0 to disable this, there is also some lag at the start in the EMA weights if you have a decay constant closer to 1. Some optimizers behave better than others during this phase, usually SGD is well behaved, Adam, and some of the adaptive ones don't like this much and may be better to disable warmup. The default Pytorch rmsprop can be unstable here but my rmsproptf variant usually handles it okay.

The resume memory issue I've seen once or twice before when running close to the GPU memory limit. I'm not quite sure why that happens. It may be partially related to a reordering of the optimizer resume I did some time ago (see commit below) to make the AMP resume correct. The interplay and ordering between Apex AMP, cuda(), DP, DDP, ModelEMA and resuming is a bit complicated.

3d9c8a6

One thing to try would be to remove the AMP state resume (I don't know if it is 100% necessary), and try moving the resume back before the model.cuda() and put the optimizer restore back before amp.init() as it was in that old commit.

@Andy1621
Copy link
Author

Andy1621 commented Jan 9, 2020

Thanks for your patience. It seems to be difficult for me to fix the error currently, for I don't konw the AMP and it's hard to remove it because of code dependency. Actually, there isn't apex package in my enviroment. I will install it and see whether mixed precision training help increase the accuracy. And Just like what you said, the error happended when running close to the GPU memory limit. It is normal to resume memory when the batch is 64*8=512, so the easy solution for me is just use small batch.

There is an interesting phenomenon when I decrease the batch size. The GPU memory increases every batch and decreases when the GPU memory closes to the limit, do you know why this happens? I think there are some allocation strategy for GPU memory in PyTorch I haven't known.
1

@rwightman
Copy link
Collaborator

rwightman commented Jan 9, 2020

@Andy1621 mixed precision won't increase the accuracy, but on volta or turing it will double the training speed and half the memory usage for these networks...

When you see the GPU memory oscillating like that, it's probably trying to find an optimal memory layout (conv algo selection)... that behaviour should go away if you change the line torch.backends.cudnn.benchmark = True in train.py to False. That can reduce the performance (speed).

I usually decrease the batch size a few notches if that happens. No need to go all the way from 128 to 64 ... generally other multiples of 8 or 16 shouldn't redue performance... so maybe try 112, 96, etc ... or install Apex, enable AMP and you can increase the batch size

@Andy1621
Copy link
Author

Andy1621 commented Jan 9, 2020

Thank you very much! It looks like it's already midnight in your city, good night~~

@pichuang1984
Copy link

I am also seeing a similar problem, where training from scratch works but resuming will result in CUDA out of memory, so I am also dialing down the batch size by multiple(s) of 8.

@rwightman
Copy link
Collaborator

rwightman commented Jan 14, 2020

@pichuang1984 as mentioned above, you can try reorganizing the load sequence if it's a big problem

  • load model before cuda()
  • load optimizer before amp init
  • comment out amp restore

I was thinking of leaving a path in there that was technically incorrect with respect to the Apex AMP recommendations (like it used to be earlier in 2019) as it seems to behave better for the memory use on resume... but then it'd all get a bit messy :)

@andravin
Copy link
Contributor

andravin commented Feb 12, 2020

I also see the out of memory error after resume, with a model that normally uses 12GB of GPU memory out of the 15.75GB capacity.

If I cut the batch size in half, then resume succeeds; but during the first epoch, GPU 0 uses almost twice as much memory as the rest, 13.5 GB versus 6.9GB.

Also, not sure how to interpret the error message, but it either indicates severe memory fragmentation or a lot of unaccounted memory:

RuntimeError: CUDA out of memory. Tried to allocate 232.00 MiB (GPU 0; 15.75 GiB total capacity; 7.63 GiB already allocated; 186.88 MiB free; 220.04 MiB cached)

Calling torch.cuda.empty_cache() before training has no effect. The functions that report memory manager details do not appear to be available in pytorch version 1.3.1.

I tried the recommended changes to train.py. There was no change in the behavior with distributed_train.sh, but single GPU training with train.py worked. ec2 killed my instance before I had a chance to save the patch, or to see if single GPU training also worked without the patch.

If I cut batch size in half during resume, everything appears to be OK, except GPU 0 shows twice as much memory used in nvidia-smi.

@andravin
Copy link
Contributor

andravin commented Feb 12, 2020

Upgrading to pytorch 1.4.0 and reinstall apex had no effect (edit: with the original train.py code, have not tried recommended fix with this setup).

@rwightman
Copy link
Collaborator

@andravin I've tried monkeying around with this, never seen it quite as bad as you described (2x) but definitely I can reproduce by resuming AMP trained models that are close to the limit. I wonder if the number of distributed nodes has an impact?

I believe I followed the recommendations in APEX AMP repo correctly with respect to the sequence of model / optimizer / amp init, etc. I have no idea what is going on there and suspect (as with #80) it's beyond my control. I tried shuffling things around (going against recommendation) and tried explicitly moving some optimizer state to specific GPUs. Sometimes it looked like an improvement but then another run would crash... so I don't have a reliable work around.

@andravin
Copy link
Contributor

I just narrowed down the memory usage explosion on device 0 to the ModelEma resume. The memory manager never seems to recover from this.

@rwightman
Copy link
Collaborator

Any obvious to(device)/cuda mistakes in the EMA resume sequence?

@rwightman
Copy link
Collaborator

Something to try, I'm not mapping the device for the load here as I do for normal checkpoint load, to either CPU or explicity to the exact GPU: https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils.py#L262

@andravin
Copy link
Contributor

It must be something like that, as ModelEma._load_checkpoint is called 8 times, once for each device, the memory increases only happen on device 0.

@andravin
Copy link
Contributor

torch.load() uses Python’s unpickling facilities but treats storages, which underlie tensors, specially. They are first deserialized on the CPU and are then moved to the device they were saved from

@andravin
Copy link
Contributor

Sure enough, this fixes it:

diff --git a/timm/utils.py b/timm/utils.py
index 59d2bcd..1da69a9 100644
--- a/timm/utils.py
+++ b/timm/utils.py
@@ -259,7 +259,7 @@ class ModelEma:
             p.requires_grad_(False)
 
     def _load_checkpoint(self, checkpoint_path):
-        checkpoint = torch.load(checkpoint_path)
+        checkpoint = torch.load(checkpoint_path, map_location='cpu')
         assert isinstance(checkpoint, dict)
         if 'state_dict_ema' in checkpoint:
             new_state_dict = OrderedDict()

@andravin
Copy link
Contributor

andravin commented Feb 12, 2020

What I want to know is, why wasn't the pytorch memory manager able to collect this ~8GB of GPU memory after the checkpoint local variable went out of scope in ModelEma._load_checkpoint?

Also, the out of memory error, which did not occur until the first training step, seemed to indicate that there was ~8GB of used GPU memory that was unseen by the pytorch memory manager.

@andravin
Copy link
Contributor

I just noticed that @Andy1621 was also using 8 GPUs in the original bug report. Could not tell if he was also using EMA. That would be consistent with the hypothesis that this bug is more severe when there are more devices, which is exactly what we saw with the EMA weights resume before the fix.

Since the fix, the issue is completely solved for me. I do not see extra GPU memory use on device #0 anymore, and resume never crashes. I suspect the patch also fixed the original issue reported by @Andy1621, but it would be good to get confirmation from him.

@rwightman
Copy link
Collaborator

@pichuang1984 was also using 8+ GPUs so curious if the change fixed it for him as well?

@rwightman
Copy link
Collaborator

Closing this since fixed for @andravin w/ 8 GPU and I am not noticing the usual smaller memory spike on 2 GPU resumes anymore.

@aegonwolf
Copy link

I am seeing the same error, I don't quite understand what you mean with load model before cuda ?

@andravin
Copy link
Contributor

@aegonwolf this issue is ancient history, long since resolved. Are you sure you are in the right place?

It might best to create a new issue with details about the problem you are seeing and how to reproduce, then reference this issue in the new report if it still seems relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants