-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't run 11 billion model on A100 with 80GB #14
Comments
Thanks for your interest in our work! It's hard to tell from the surface. Could you share with me the full log? |
Hi @HaokunLiu
|
I remember the code will print out all the args in the beginning. Could you share that with me? |
Sorry I think the config might be slightly off as it was meant for the 3B and not 11B versions. For the 11B variants, to fit into memory, we used a smaller batch size but still had an effect batch size of 8. Our hyperparameters were |
thanks! |
Hi @craffel @muqeeth @HaokunLiu,
We're trying to reproduce
T-Few
results for a paper, but we're getting 'CUDA out of memory' using an A100 with 80GB (your recommended setup).This is what we're running:
python -m src.pl_train -c t011b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t011b_ia3_finish.pt" exp_name=t011b_rte_seed42_ia3_pretrained few_shot_random_seed=42 seed=42
We installed according to the README instructions and are using the default settings in the config files.
We are able to run the 3 billion model using the command above, just not the 11 billion.
Is there anything we are doing wrong?
This is the exception:
Thank you
The text was updated successfully, but these errors were encountered: