Fine-Tuning very slow (6h->24h??) #32

chavinlo · 2023-03-15T15:59:13Z

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

joaopandolfi · 2023-03-15T16:03:40Z

Can you share the result file when you finish?

chavinlo · 2023-03-15T16:42:05Z

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure...
heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

lxuechen · 2023-03-15T16:54:23Z

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

chavinlo · 2023-03-15T18:10:21Z

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.
I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?
Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?
Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.
Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

Power usage is a bit low, but the rest is at max:

the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano

currently it's going at 64.81s/it after a reboot.

charliezjw · 2023-03-15T20:42:18Z

I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

Tiiiger · 2023-03-15T20:44:24Z

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running

chavinlo · 2023-03-15T20:58:52Z

are you running the released code? best to adapt from there.

Yes I am running the fine-tuning code from this repo.

devilismyfriend · 2023-03-15T22:02:05Z

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D

melisa-writer · 2023-03-15T23:16:16Z

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

charliezjw · 2023-03-16T00:19:28Z

I solved the error by replacing every "LLaMA" to "Llama"

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.
Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

XuhuiRen · 2023-03-16T01:49:39Z

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.
Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved.

chavinlo · 2023-03-16T02:40:13Z

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

0xbitches · 2023-03-16T03:21:48Z

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

chavinlo · 2023-03-16T03:24:07Z

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentiall larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

yes, but once I get my a100s fixed because there definetly is something throttling them

0xbitches · 2023-03-16T04:06:00Z

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

devilismyfriend · 2023-03-16T05:18:18Z

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try

devilismyfriend · 2023-03-16T05:18:34Z

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

thanks! can't wait to test the final checkpoint :)

0xbitches · 2023-03-16T05:30:50Z

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

cxj01 · 2023-03-16T07:18:38Z

@chavinlo

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running
I have the same problem. Have you solved it?

chavinlo · 2023-03-16T07:26:30Z

@chavinlo

hi @chavinlo ,
are you running the released code? best to adapt from there.
Thanks
and @charliezjw,
there is not enough details for me to respond to. what code are you running
I have the same problem. Have you solved it?

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

cxj01 · 2023-03-16T07:28:58Z

@chavinlo
I have the same problem. Have you solved it?
Exception: Could not find the transformer layer class to wrap in the model.

lxuechen · 2023-03-16T07:46:59Z

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

chavinlo · 2023-03-16T07:54:12Z

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason

devilismyfriend · 2023-03-16T12:41:25Z

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

search for the inference kwargs issue here, use the parameters in the generation config in the lora repo

chavinlo · 2023-03-16T23:18:41Z

@lxuechen Can you reopen this issue? The original problem was about speed, not about layers.
Aditionally, I have tried cleaning the instance, and I still get the same speed.

chavinlo · 2023-03-16T23:50:22Z

report.txt
here is the nvidia-smi -q report of my gpus...

helloeve · 2023-03-17T00:39:11Z

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

puyuanliu · 2023-03-17T00:48:56Z

My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs)

Environment:
Pytorch=1.13
transformer (latest one on git)

Command:

torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

However, once the training finishes, the model is generating non-sense tokens as mentioned in #70 and #51

chavinlo · 2023-03-17T01:27:07Z

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h.

However, someone told me that runpod works alright. Here are his graphs:

You can see a 20-30% power usage increase from what I got: https://wandb.ai/peruano/huggingface/runs/jbeh9a6r?workspace=user-peruano

447428054 · 2023-03-17T01:31:02Z

@charliezjw @puyuanliu
Excuse me.Is the following installation method correct?

pip install git+https://github.com/zphang/transformers.git@llama_push

Each version is as follows：
numpy==1.24.2
rouge-score==0.1.2
fire==0.5.0
openai==0.27.2
sentencepiece==0.1.97
wandb==0.14.0

puyuanliu · 2023-03-17T01:32:56Z

@447428054 You can actually directly pull from the transformer git repo. They merged zphang's pull request 12 hours ago.

helloeve · 2023-03-17T17:53:22Z

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

puyuanliu · 2023-03-17T18:13:57Z

@helloeve Do you see any CUDA OOM errors while training? And is your inference output reasonable?

chavinlo · 2023-03-17T18:21:54Z

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

God bless you... this works perfectly, ETA is now 5:30... 16s/it

I believe this issue can be closed down then? @lxuechen Could this be added in the documentation just in case?

helloeve · 2023-03-17T18:26:50Z

@helloeve Do you see any CUDA OOM errors while training? And is your inference output reasonable?

I don't notice any OOM error. The inference output is still yet to be seen after the training, but I guess it would be better to discuss in the other issue(s) that has been created for this topic?

helloeve · 2023-03-17T18:28:46Z

@chavinlo I made a tiny PR to add a notes in the README - #79. Hopefully this can help others avoid struggling with reproducing the finetuning speed.

puyuanliu · 2023-03-17T18:36:18Z

@helloeve Thanks for the reply! I will put it under other topics.

helloeve · 2023-03-17T18:39:16Z

@puyuanliu If you don't mind sharing your code to reproduce problematic predicting output I can give it a try once my training is done.

puyuanliu · 2023-03-17T18:43:52Z

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

helloeve · 2023-03-17T18:44:47Z

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

Will reply in #51

puyuanliu · 2023-03-17T18:45:51Z

@helloeve Thanks again!

chavinlo · 2023-03-17T18:47:15Z

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

I also got OOM with bs4, bs3 seems to work fine so far with a slight increase of 30 min. ETA

lxuechen · 2023-03-17T23:03:19Z

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

FSDP full_shard saves memory with communication, because each wrapped block's weights are sharded and distributed evenly across all your devices. This means each forward and backward for a single wrapped block needs to call all gather and scatter.

The issue you encountered is likely GPU <-> GPU communication (the precise reason I don't fully understand). shard_grad_op reduces the communication load, but consumes slightly more VRAM.

We will send in a PR to document this.

MarkSchmidty · 2023-03-28T14:03:41Z

@chavinlo Your 7B Native is the best Alpaca Finetune available.

Lots of people are excited to try your 13B Native finetune. Can you re-upload it to HF?

chavinlo · 2023-03-28T15:06:22Z

@chavinlo Your 7B Native is the best Alpaca Finetune available.

Lots of people are excited to try your 13B Native finetune. Can you re-upload it to HF?

I did one but I deleted it by accident lololol
Gonna train again later today

MarkSchmidty · 2023-03-28T15:08:42Z

I did one but I deleted it by accident lololol

Oops!

Looking forward to 30B as well.

chavinlo · 2023-03-28T15:13:12Z

Aditionally, the reason of the slowdown was because of a misconfiguration in the cross GPU communication. It was using PCIe cpu (PHB) rather than NVLINK. I verified this via nvidia-smi topo -m. I fixed this by contacting my GPU provider as this was an error on their end.

While the solution above works, it probably wont work for larger models (?)

Wingie · 2023-03-31T05:16:07Z

Where do you rent your gpus? Is it on a spot instance? LambdaLabs always has the 8xa100s kon available for me :( also does that mean that that the PCIe gpu topologies will be very slow for 13b and they should be avoided?

chavinlo · 2023-03-31T13:02:45Z

Where do you rent your gpus? Is it on a spot instance? LambdaLabs always has the 8xa100s kon available for me :( also does that mean that that the PCIe gpu topologies will be very slow for 13b and they should be avoided?

I think it works for 13b, not sure about other models

lxuechen closed this as completed Mar 16, 2023

lxuechen reopened this Mar 16, 2023

helloeve mentioned this issue Mar 17, 2023

update notes for training slowdown #79

Open

puyuanliu mentioned this issue Mar 17, 2023

A brief summary of the potential issues during the replication and corresponding solutons #81

Open

lxuechen closed this as completed Mar 17, 2023

Fine-Tuning very slow (6h->24h??) #32

Fine-Tuning very slow (6h->24h??) #32

Comments

chavinlo commented Mar 15, 2023

joaopandolfi commented Mar 15, 2023

chavinlo commented Mar 15, 2023

lxuechen commented Mar 15, 2023 • edited Loading

chavinlo commented Mar 15, 2023

charliezjw commented Mar 15, 2023

Tiiiger commented Mar 15, 2023

chavinlo commented Mar 15, 2023

devilismyfriend commented Mar 15, 2023 • edited Loading

melisa-writer commented Mar 15, 2023

charliezjw commented Mar 16, 2023

XuhuiRen commented Mar 16, 2023

chavinlo commented Mar 16, 2023

0xbitches commented Mar 16, 2023 • edited Loading

chavinlo commented Mar 16, 2023

0xbitches commented Mar 16, 2023 • edited Loading

devilismyfriend commented Mar 16, 2023 • edited Loading

devilismyfriend commented Mar 16, 2023

0xbitches commented Mar 16, 2023 • edited Loading

cxj01 commented Mar 16, 2023

chavinlo commented Mar 16, 2023

cxj01 commented Mar 16, 2023

lxuechen commented Mar 16, 2023

chavinlo commented Mar 16, 2023

devilismyfriend commented Mar 16, 2023

chavinlo commented Mar 16, 2023

chavinlo commented Mar 16, 2023

helloeve commented Mar 17, 2023

puyuanliu commented Mar 17, 2023

chavinlo commented Mar 17, 2023 • edited Loading

447428054 commented Mar 17, 2023

puyuanliu commented Mar 17, 2023

helloeve commented Mar 17, 2023

puyuanliu commented Mar 17, 2023

chavinlo commented Mar 17, 2023

helloeve commented Mar 17, 2023

helloeve commented Mar 17, 2023

puyuanliu commented Mar 17, 2023

helloeve commented Mar 17, 2023 • edited Loading

puyuanliu commented Mar 17, 2023

helloeve commented Mar 17, 2023

puyuanliu commented Mar 17, 2023

chavinlo commented Mar 17, 2023

lxuechen commented Mar 17, 2023

MarkSchmidty commented Mar 28, 2023

chavinlo commented Mar 28, 2023

MarkSchmidty commented Mar 28, 2023

chavinlo commented Mar 28, 2023

Wingie commented Mar 31, 2023

chavinlo commented Mar 31, 2023

lxuechen commented Mar 15, 2023 •

edited

Loading

devilismyfriend commented Mar 15, 2023 •

edited

Loading

0xbitches commented Mar 16, 2023 •

edited

Loading

0xbitches commented Mar 16, 2023 •

edited

Loading

devilismyfriend commented Mar 16, 2023 •

edited

Loading

0xbitches commented Mar 16, 2023 •

edited

Loading

chavinlo commented Mar 17, 2023 •

edited

Loading

helloeve commented Mar 17, 2023 •

edited

Loading