CUDA out of memory for a single core A100 80G GPU #56

leondelee · 2023-03-16T11:20:14Z

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

KurtFeynmanGodel · 2023-03-16T14:17:21Z

I cant say for certain. But the first thing you should try is setting the batch sizes to 1 and the gradient accumulation to 1 as well. That is the configuration that gives the minimal memory footprint without any code changes. Start there.

yysjasmine · 2023-03-20T04:47:42Z

I encountered the CUDA OOM on a single core A100 80G using your training code? Can i fix this by changing anything?

Can you fix the OOM problem? I ran in to that problem as well using python 3.8 and pytorch 1.13.1，in a single core A100 80G.

JoelNiklaus · 2023-03-24T16:13:27Z

I have the same problem

dlwh · 2023-03-24T16:37:01Z

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

JoelNiklaus · 2023-03-24T16:42:42Z

Thanks @dlwh. When switching to two A100 80GB GPUs it worked for me.

zhl5842 · 2023-03-25T06:55:44Z

Just to swoop in here, if you're using 6.7B or larger, 1 GPU (even an A100) isn't going to be enough without DeepSpeed Zero CPU Offload or Zero Infinity. At a minimum, to fine tune you need 6.7e9 * 4 bytes * 3 = 80.4GB to store parameters and optimizer states (even in mixed precision, you want to store those in fp32)

why *3 ?

dlwh · 2023-03-25T07:08:08Z

Adam, the default optimizer, stores two momentum terms which are each the same size as the parameters themselves, so essentially three copies of the parameters... plus a fourth half-sized one (so, 14 bytes total) for the gradients, and then memory for activations that can range from "not very much" to a lot of extra GB.

2 A100s or Zero CPU offload should fix it right up.

yysjasmine · 2023-03-27T03:08:45Z

Thanks a lot, 2+ gpus worked fine.

maziyarpanahi · 2023-03-30T18:14:25Z

What was the batch size for those who could make it work with 2 A100 80GB GPUs? I have 4 and it fails. The reason I am asking is mine only works wit batch_size set to 1 on 7B. However, the README says?

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode: (batch_size is 4 here, how did they manage this?)

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

JoelNiklaus · 2023-03-31T08:29:36Z

For me it also worked with batch size 4 on 2 80GB A100 GPUs for sequence length 512.

maziyarpanahi · 2023-03-31T08:34:35Z

That's really strange, the default value for model_max_length is already 512. With batch_size 1 it takes forever unfortunately

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory for a single core A100 80G GPU #56

CUDA out of memory for a single core A100 80G GPU #56

leondelee commented Mar 16, 2023

KurtFeynmanGodel commented Mar 16, 2023

yysjasmine commented Mar 20, 2023

JoelNiklaus commented Mar 24, 2023

dlwh commented Mar 24, 2023

JoelNiklaus commented Mar 24, 2023

zhl5842 commented Mar 25, 2023

dlwh commented Mar 25, 2023

yysjasmine commented Mar 27, 2023

maziyarpanahi commented Mar 30, 2023 •

edited

JoelNiklaus commented Mar 31, 2023

maziyarpanahi commented Mar 31, 2023

CUDA out of memory for a single core A100 80G GPU #56

CUDA out of memory for a single core A100 80G GPU #56

Comments

leondelee commented Mar 16, 2023

KurtFeynmanGodel commented Mar 16, 2023

yysjasmine commented Mar 20, 2023

JoelNiklaus commented Mar 24, 2023

dlwh commented Mar 24, 2023

JoelNiklaus commented Mar 24, 2023

zhl5842 commented Mar 25, 2023

dlwh commented Mar 25, 2023

yysjasmine commented Mar 27, 2023

maziyarpanahi commented Mar 30, 2023 • edited

JoelNiklaus commented Mar 31, 2023

maziyarpanahi commented Mar 31, 2023

maziyarpanahi commented Mar 30, 2023 •

edited