Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory for a single core A100 80G GPU #56

Open
leondelee opened this issue Mar 16, 2023 · 11 comments
Open

CUDA out of memory for a single core A100 80G GPU #56

leondelee opened this issue Mar 16, 2023 · 11 comments

Comments

@leondelee
Copy link

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

@KurtFeynmanGodel
Copy link

I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there.

@yysjasmine
Copy link

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1,in a single core A100 80G.

@JoelNiklaus
Copy link

I have the same problem

@dlwh
Copy link

dlwh commented Mar 24, 2023

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

@JoelNiklaus
Copy link

Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me.

@zhl5842
Copy link

zhl5842 commented Mar 25, 2023

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

why *3 ?

@dlwh
Copy link

dlwh commented Mar 25, 2023

Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB.

2 A100s or Zero CPU offload should fix it right up.

@yysjasmine
Copy link

Thanks a lot, 2+ gpus worked fine.

@maziyarpanahi
Copy link

maziyarpanahi commented Mar 30, 2023

What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode: (batch_size is 4 here, how did they manage this?)

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

@JoelNiklaus
Copy link

For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512.

@maziyarpanahi
Copy link

That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@dlwh @JoelNiklaus @zhl5842 @maziyarpanahi @leondelee @KurtFeynmanGodel @yysjasmine and others