-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory for a single core A100 80G GPU #56
Comments
I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there. |
Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1,in a single core A100 80G. |
I have the same problem |
Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need |
Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me. |
why *3 ? |
Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB. 2 A100s or Zero CPU offload should fix it right up. |
Thanks a lot, 2+ gpus worked fine. |
What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?
|
For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512. |
That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately |
I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?
The text was updated successfully, but these errors were encountered: