-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-Tuning very slow (6h->24h??) #32
Comments
Can you share the result file when you finish? |
sure, but theres a LORA repo that supposedly gives better results than the current one, not sure... |
Thanks for your interest! That should not happen, in principle. I reproduced our initial model with the given training code and command with a training run that lasts less than 2 hours. Could you share more details of your setup? Are you getting reasonable GPU utilization? |
Power usage is a bit low, but the rest is at max: the full stats are available on wandb: https://wandb.ai/peruano/huggingface/runs/68m25500?workspace=user-peruano currently it's going at 64.81s/it after a reboot. |
I am keep getting this error message, I am wondering whether you have seen it: Thank you! |
hi @chavinlo , are you running the released code? best to adapt from there. Thanks and @charliezjw, there is not enough details for me to respond to. what code are you running |
Yes I am running the fine-tuning code from this repo. |
I tried the LORA one (from the repo, not yours) and found it to be worse, getting a lot of "noinput" vs the Stanford one, so let us know if you get it trained :D |
did you install the commit mentioned in the README? I got this error when I installed current version of the HF PR.
|
I solved the error by replacing every "LLaMA" to "Llama"
|
Try to revise the parameter "fsdp_transformer_layer_cls_to_wrap" to "LlamaDecoderLayer", this issue could be solved. |
@joaopandolfi @devilismyfriend I will be progressively uploading the checkpoints, it's getting saved every 200 steps, total is 1200. Heres the 200step checkpoint: https://huggingface.co/chavinlo/alpaca-native/tree/main aka 17% |
@chavinlo thank you for your work! Are you able to train the LORA on 13b (or potentially larger)? Also, since the loss stops decreasing after ~1 epoch, it might not be necessary to keep training. |
yes, but once I get my a100s fixed because there definetly is something throttling them |
Thanks - I just finetuned my own lora with tloen's code, unfortunately the results are not too different from 7B with lora. Maybe I am using the wrong prompts to test it lol |
I encountered the same issue like you, I used the inference kwargs from this repo instead of what he has there and it's miles better, still think it's a bit worse then this but give it a try |
thanks! can't wait to test the final checkpoint :) |
Could you specify what you meant here? Did you use the alpaca code and train your own model? |
|
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? |
@chavinlo |
The reason for some of these issues is explained in this note. Feel free to reopen if it doesn't fully resolve the mysteries :) |
My issue is about speed... Nothing related about these layer class errors, they just came into my issue for some reason |
search for the inference kwargs issue here, use the parameters in the generation config in the lora repo |
@lxuechen Can you reopen this issue? The original problem was about speed, not about layers. |
report.txt |
@chavinlo I am on the same boat. ~24 hour ETA with 4 X 80G A100. My GPU power usage is also low(~80W/300W) which seems to be suspicious. |
My training finishes in 4 hours with 8 * A100 (40G) using fp16. (3 epochs) Environment: Command: torchrun --nproc_per_node=8 --master_port=1234 train.py --model_name_or_path converted_llama_7B --data_path ./alpaca_data.json --fp16 True --output_dir ./trained_model --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 5 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' However, once the training finishes, the model is generating non-sense tokens as mentioned in #70 and #51 |
I think it might have to do with something in the os configuration. I've changed GPUs and they still perform the same, even the first time it went 2.5 times slower giving an ETA of 52h. However, someone told me that runpod works alright. Here are his graphs: |
@charliezjw @puyuanliu pip install git+https://github.com/zphang/transformers.git@llama_push Each version is as follows: |
@447428054 You can actually directly pull from the transformer git repo. They merged zphang's pull request 12 hours ago. |
@chavinlo after some investigation, I realize the slowdown has something to do with fsdp and A100 GPU. I was able to get <4 hour ETA on 4XA100 80GPU simply by setting |
@helloeve Do you see any CUDA OOM errors while training? And is your inference output reasonable? |
God bless you... this works perfectly, ETA is now 5:30... 16s/it I believe this issue can be closed down then? @lxuechen Could this be added in the documentation just in case? |
I don't notice any OOM error. The inference output is still yet to be seen after the training, but I guess it would be better to discuss in the other issue(s) that has been created for this topic? |
@helloeve Thanks for the reply! I will put it under other topics. |
@puyuanliu If you don't mind sharing your code to reproduce problematic predicting output I can give it a try once my training is done. |
@helloeve Thanks again! |
I also got OOM with bs4, bs3 seems to work fine so far with a slight increase of 30 min. ETA |
FSDP The issue you encountered is likely GPU <-> GPU communication (the precise reason I don't fully understand). We will send in a PR to document this. |
@chavinlo Your 7B Native is the best Alpaca Finetune available. Lots of people are excited to try your 13B Native finetune. Can you re-upload it to HF? |
I did one but I deleted it by accident lololol |
Oops! Looking forward to 30B as well. |
Aditionally, the reason of the slowdown was because of a misconfiguration in the cross GPU communication. It was using PCIe cpu (PHB) rather than NVLINK. I verified this via While the solution above works, it probably wont work for larger models (?) |
Where do you rent your gpus? Is it on a spot instance? LambdaLabs always has the 8xa100s kon available for me :( also does that mean that that the PCIe gpu topologies will be very slow for 13b and they should be avoided? |
I think it works for 13b, not sure about other models |
Hello, first of all thank you for releasing the training code for alpaca, we really appreaciate it.
I am running the fine-tuning script on an 4xA100-SXM4-80GB, and currently getting an 24H ETA. Which doesn't really scales with the reported "3 hours on 8 80GB A100s" mentioned on https://crfm.stanford.edu/2023/03/13/alpaca.html , Shouldn't it be around 6hours, or even 12hours considering that the script "is not particularly optimized"?
Is anyone else encountering this issue? And if this is expected, then what were the methods you used to optimize the fine-tuning process?
Running on CUDA 12.1, Torch 1.13, and the transformers fork of llama at the commit you mentioned.
Thanks.
The text was updated successfully, but these errors were encountered: