New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory when load model #72
Comments
The low initial accuracy is likely due to the warmup epochs where LR increases from a small value up to the normal learning rate before following the LR schedule. Set The resume memory issue I've seen once or twice before when running close to the GPU memory limit. I'm not quite sure why that happens. It may be partially related to a reordering of the optimizer resume I did some time ago (see commit below) to make the AMP resume correct. The interplay and ordering between Apex AMP, cuda(), DP, DDP, ModelEMA and resuming is a bit complicated. One thing to try would be to remove the AMP state resume (I don't know if it is 100% necessary), and try moving the resume back before the model.cuda() and put the optimizer restore back before amp.init() as it was in that old commit. |
@Andy1621 mixed precision won't increase the accuracy, but on volta or turing it will double the training speed and half the memory usage for these networks... When you see the GPU memory oscillating like that, it's probably trying to find an optimal memory layout (conv algo selection)... that behaviour should go away if you change the line I usually decrease the batch size a few notches if that happens. No need to go all the way from 128 to 64 ... generally other multiples of 8 or 16 shouldn't redue performance... so maybe try 112, 96, etc ... or install Apex, enable AMP and you can increase the batch size |
Thank you very much! It looks like it's already midnight in your city, good night~~ |
I am also seeing a similar problem, where training from scratch works but resuming will result in CUDA out of memory, so I am also dialing down the batch size by multiple(s) of 8. |
@pichuang1984 as mentioned above, you can try reorganizing the load sequence if it's a big problem
I was thinking of leaving a path in there that was technically incorrect with respect to the Apex AMP recommendations (like it used to be earlier in 2019) as it seems to behave better for the memory use on resume... but then it'd all get a bit messy :) |
I also see the out of memory error after resume, with a model that normally uses 12GB of GPU memory out of the 15.75GB capacity. If I cut the batch size in half, then resume succeeds; but during the first epoch, GPU 0 uses almost twice as much memory as the rest, 13.5 GB versus 6.9GB. Also, not sure how to interpret the error message, but it either indicates severe memory fragmentation or a lot of unaccounted memory:
Calling I tried the recommended changes to If I cut batch size in half during resume, everything appears to be OK, except GPU 0 shows twice as much memory used in |
Upgrading to |
@andravin I've tried monkeying around with this, never seen it quite as bad as you described (2x) but definitely I can reproduce by resuming AMP trained models that are close to the limit. I wonder if the number of distributed nodes has an impact? I believe I followed the recommendations in APEX AMP repo correctly with respect to the sequence of model / optimizer / amp init, etc. I have no idea what is going on there and suspect (as with #80) it's beyond my control. I tried shuffling things around (going against recommendation) and tried explicitly moving some optimizer state to specific GPUs. Sometimes it looked like an improvement but then another run would crash... so I don't have a reliable work around. |
I just narrowed down the memory usage explosion on device 0 to the ModelEma resume. The memory manager never seems to recover from this. |
Any obvious to(device)/cuda mistakes in the EMA resume sequence? |
Something to try, I'm not mapping the device for the load here as I do for normal checkpoint load, to either CPU or explicity to the exact GPU: https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils.py#L262 |
It must be something like that, as ModelEma._load_checkpoint is called 8 times, once for each device, the memory increases only happen on device 0. |
|
Sure enough, this fixes it:
|
What I want to know is, why wasn't the pytorch memory manager able to collect this ~8GB of GPU memory after the Also, the out of memory error, which did not occur until the first training step, seemed to indicate that there was ~8GB of used GPU memory that was unseen by the pytorch memory manager. |
I just noticed that @Andy1621 was also using 8 GPUs in the original bug report. Could not tell if he was also using EMA. That would be consistent with the hypothesis that this bug is more severe when there are more devices, which is exactly what we saw with the EMA weights resume before the fix. Since the fix, the issue is completely solved for me. I do not see extra GPU memory use on device #0 anymore, and resume never crashes. I suspect the patch also fixed the original issue reported by @Andy1621, but it would be good to get confirmation from him. |
@pichuang1984 was also using 8+ GPUs so curious if the change fixed it for him as well? |
Closing this since fixed for @andravin w/ 8 GPU and I am not noticing the usual smaller memory spike on 2 GPU resumes anymore. |
I am seeing the same error, I don't quite understand what you mean with load model before |
@aegonwolf this issue is ancient history, long since resolved. Are you sure you are in the right place? It might best to create a new issue with details about the problem you are seeing and how to reproduce, then reference this issue in the new report if it still seems relevant. |
I have train mobilenetv3_large_100 using 8 2080Ti GPU, and the batch size is 128, which means 128 * 8 =1024 pictures every batch. When I resumed the model, there was an "CUDA out of memory" error. However, when I trained it again from scratch, there wasn't any error.
I noticed that your codes of "helper.py" has loaded the model in cpu, it should be the solution for this bug, but why this happend?
checkpoint = torch.load(checkpoint_path, map_location='cpu')
Another interesting problem is that I find the acc@1 is very low in the first few epochs(nearly random property), and the eval_loss even rises, why???
The text was updated successfully, but these errors were encountered: