Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-Tuning very slow (6h->24h??) #32

Closed
chavinlo opened this issue Mar 15, 2023 · 49 comments
Closed

Fine-Tuning very slow (6h->24h??) #32

chavinlo opened this issue Mar 15, 2023 · 49 comments

Comments

@chavinlo
Copy link

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

@joaopandolfi
Copy link

Can you share the result file when you finish?

@chavinlo
Copy link
Author

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure...
heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

@lxuechen
Copy link
Collaborator

lxuechen commented Mar 15, 2023

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.

I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?

Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?

Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.

Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

@chavinlo
Copy link
Author

Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.
I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?
Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?
Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.
Thanks.

Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization?

Power usage is a bit low, but the rest is at max:
image

the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano

currently it's going at 64.81s/it after a reboot.

@charliezjw
Copy link

I am keep getting this error message, I am wondering whether you have seen it:
Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

@Tiiiger
Copy link
Collaborator

Tiiiger commented Mar 15, 2023

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running

@chavinlo
Copy link
Author

are you running the released code? best to adapt from there.

Yes I am running the fine-tuning code from this repo.

@devilismyfriend
Copy link

devilismyfriend commented Mar 15, 2023

Can you share the result file when you finish?

sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... heres those: https://huggingface.co/chavinlo/alpaca https://github.com/tloen/alpaca-lora

I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D

@melisa-writer
Copy link

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.

Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

@charliezjw
Copy link

I solved the error by replacing every "LLaMA" to "Llama"

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.
Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

@XuhuiRen
Copy link

https://github.com/tloen/alpaca-lora

I am keep getting this error message, I am wondering whether you have seen it: Exception: Could not find the transformer layer class to wrap in the model.
Thank you!

did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.

pip install git+https://github.com/zphang/transformers.git@68d640f7c368bcaaaecfc678f11908ebbd3d6176

Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved.

@chavinlo
Copy link
Author

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

@0xbitches
Copy link

0xbitches commented Mar 16, 2023

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

@chavinlo
Copy link
Author

@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentiall larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training.

yes, but once I get my a100s fixed because there definetly is something throttling them

@0xbitches
Copy link

0xbitches commented Mar 16, 2023

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

@devilismyfriend
Copy link

devilismyfriend commented Mar 16, 2023

yes, but once I get my a100s fixed because there definetly is something throttling them

Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol

I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try

@devilismyfriend
Copy link

@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200.

Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17%

thanks! can't wait to test the final checkpoint :)

@0xbitches
Copy link

0xbitches commented Mar 16, 2023

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

@cxj01
Copy link

cxj01 commented Mar 16, 2023

@chavinlo

hi @chavinlo ,

are you running the released code? best to adapt from there.

Thanks

and @charliezjw,

there is not enough details for me to respond to. what code are you running
I have the same problem. Have you solved it?

@chavinlo
Copy link
Author

@chavinlo

hi @chavinlo ,
are you running the released code? best to adapt from there.
Thanks
and @charliezjw,
there is not enough details for me to respond to. what code are you running
I have the same problem. Have you solved it?

????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

@cxj01
Copy link

cxj01 commented Mar 16, 2023

@chavinlo
I have the same problem. Have you solved it?
Exception: Could not find the transformer layer class to wrap in the model.

@lxuechen
Copy link
Collaborator

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

@chavinlo
Copy link
Author

The reason for some of these issues is explained in this note.

Feel free to reopen if it doesn't fully resolve the mysteries :)

My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason

@devilismyfriend
Copy link

inference kwargs from this repo

Could you specify what you meant here? Did you use the alpaca code and train your own model?

search for the inference kwargs issue here, use the parameters in the generation config in the lora repo

@chavinlo
Copy link
Author

@lxuechen Can you reopen this issue? The original problem was about speed, not about layers.
Aditionally, I have tried cleaning the instance, and I still get the same speed.

@lxuechen lxuechen reopened this Mar 16, 2023
@chavinlo
Copy link
Author

report.txt
here is the nvidia-smi -q report of my gpus...

@helloeve
Copy link

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

@puyuanliu
Copy link

My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs)

Environment:
Pytorch=1.13
transformer (latest one on git)

Command:

torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

However, once the training finishes, the model is generating non-sense tokens as mentioned in #70 and #51

@chavinlo
Copy link
Author

chavinlo commented Mar 17, 2023

@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious.

I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h.

However, someone told me that runpod works alright. Here are his graphs:
image
You can see a 20-30% power usage increase from what I got: https://wandb.ai/peruano/huggingface/runs/jbeh9a6r?workspace=user-peruano

@447428054
Copy link

@charliezjw @puyuanliu
Excuse me.Is the following installation method correct?

pip install git+https://github.com/zphang/transformers.git@llama_push

Each version is as follows:
numpy==1.24.2
rouge-score==0.1.2
fire==0.5.0
openai==0.27.2
sentencepiece==0.1.97
wandb==0.14.0

@puyuanliu
Copy link

@447428054 You can actually directly pull from the transformer git repo. They merged zphang's pull request 12 hours ago.

@helloeve
Copy link

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

@puyuanliu
Copy link

@helloeve Do you see any CUDA OOM errors while training? And is your inference output reasonable?

@chavinlo
Copy link
Author

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

God bless you... this works perfectly, ETA is now 5:30... 16s/it

I believe this issue can be closed down then? @lxuechen Could this be added in the documentation just in case?

@helloeve
Copy link

@helloeve Do you see any CUDA OOM errors while training? And is your inference output reasonable?

I don't notice any OOM error. The inference output is still yet to be seen after the training, but I guess it would be better to discuss in the other issue(s) that has been created for this topic?

@helloeve
Copy link

@chavinlo I made a tiny PR to add a notes in the README - #79. Hopefully this can help others avoid struggling with reproducing the finetuning speed.

@puyuanliu
Copy link

@helloeve Thanks for the reply! I will put it under other topics.

@helloeve
Copy link

helloeve commented Mar 17, 2023

@puyuanliu If you don't mind sharing your code to reproduce problematic predicting output I can give it a try once my training is done.

@puyuanliu
Copy link

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

@helloeve
Copy link

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

Will reply in #51

@puyuanliu
Copy link

@helloeve Thanks again!

@chavinlo
Copy link
Author

@helloeve Thanks a lot! I was using 8*A100 (40G) and the error I got was mentioned in #65 . I still get a saved model of ~26 GB after this error but it seems like the model weights are corrupted. This is also reported in #51

I also got OOM with bs4, bs3 seems to work fine so far with a slight increase of 30 min. ETA

@lxuechen
Copy link
Collaborator

@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting --fsdp "shard_grad_op auto_wrap". I don't have a full story of why it is the case but it seems to be working well. Hope this can help resolve your case as well.

FSDP full_shard saves memory with communication, because each wrapped block's weights are sharded and distributed evenly across all your devices. This means each forward and backward for a single wrapped block needs to call all gather and scatter.

The issue you encountered is likely GPU <-> GPU communication (the precise reason I don't fully understand). shard_grad_op reduces the communication load, but consumes slightly more VRAM.

We will send in a PR to document this.

@MarkSchmidty
Copy link

@chavinlo Your 7B Native is the best Alpaca Finetune available.

Lots of people are excited to try your 13B Native finetune. Can you re-upload it to HF?

@chavinlo
Copy link
Author

@chavinlo Your 7B Native is the best Alpaca Finetune available.

Lots of people are excited to try your 13B Native finetune. Can you re-upload it to HF?

I did one but I deleted it by accident lololol
Gonna train again later today

@MarkSchmidty
Copy link

I did one but I deleted it by accident lololol

Oops!

Looking forward to 30B as well.

@chavinlo
Copy link
Author

Aditionally, the reason of the slowdown was because of a misconfiguration in the cross GPU communication. It was using PCIe cpu (PHB) rather than NVLINK. I verified this via nvidia-smi topo -m. I fixed this by contacting my GPU provider as this was an error on their end.

While the solution above works, it probably wont work for larger models (?)

@Wingie
Copy link

Wingie commented Mar 31, 2023

Where do you rent your gpus? Is it on a spot instance? LambdaLabs always has the 8xa100s kon available for me :( also does that mean that that the PCIe gpu topologies will be very slow for 13b and they should be avoided?

@chavinlo
Copy link
Author

Where do you rent your gpus? Is it on a spot instance? LambdaLabs always has the 8xa100s kon available for me :( also does that mean that that the PCIe gpu topologies will be very slow for 13b and they should be avoided?

I think it works for 13b, not sure about other models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests