-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case Study: Instruction Tuning on AQLM Models #38
Comments
Hi @hiyouga ! import aqlm
with aqlm.optimize_for_training():
model = AutoModelForCausalLM.from_pretrained(...) The thing is, there a few ways to compute a forward pass and some of them work better for very small number of tokens (e.g. generation), and some are optimized for large batch sizes (e.g. training). We're hoping to be able to determine which kernels to use dynamically in later versions of aqlm, but, for now, please add that wrapped explicitly. Also, keep in mind, that a model loaded under that wrapper will be very slow on generation. We're working on making it a more pleasant experience! |
@BlackSamorez Indeed! It's very important to me, I will try to fine-tune the model again following your advice. Thanks for pointing it out! |
A bit more context: those are the speeds for a typical layer on an RTX 3090 GPU. We have a kernel for a single token pass (generation), which is slightly faster than
As of now, we don't have a good enough kernel for anything between 4 and 4000 tokens processed in a pass. We're hoping to implement them someday. |
I see. The generation is much faster than training, and it might also be related to the gradient checkpointing technique in training. |
I've merged #39 and released |
Sounds great! We will instruct users to use the latest AQLM in our training framework |
This issue is stale because it has been open for 30 days with no activity. |
Hi, we have performed a small experiment on fine-tuning the Llama-2-70B-AQLM-2Bit model using the PEFT QLoRA method. We utilized the Alpaca and Glaive datasets for instruction tuning, and the fine-tuned version demonstrates preliminary conversation and tool-using abilities. We found that the training only requires 24GB of GRAM, while inference only needs 20GB. Thus fine-tuning a 70B model on consumer devices can be feasible. However, we found that AQLM significantly increases the overall training time. It would be better if the training speed could be improved. Thanks again for your excellent work!
Adapter weights: https://huggingface.co/hiyouga/Llama-2-70b-AQLM-2Bit-QLoRA-function-calling
The text was updated successfully, but these errors were encountered: