Skip to content

uyhcire/distributed-lora-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark to measure the training throughput of a 7B LoRA model and a 1.4B baseline on standard EC2 instances. Uses Terraform to manage instances.

Requirements

To run this benchmark, you'll need to have an AWS account. You'll also need sufficient quota to spin up the number of g5.xlarge instances that you want to test with.

You'll also need to obtain access to the meta-llama/Llama-2-7b-chat-hf model on Hugging Face. If you haven't already, you'll need to create an account on Hugging Face and follow the instructions on the model's page. Then, you'll need an HF access token for your account, which you can generate from this page.

How to run

Install dependencies

Spin up EC2 instances

terraform apply -var "public_key=$(cat $path_to_your_id_rsa_dot_pub_or_equivalent)"

Run the benchmark

First, wait 80 seconds or so for the instances to be ready.

$ sleep 80

Then, run the script that runs the benchmark:

$ python run.py --ssh-privkey-path $path_to_your_id_rsa_or_equivalent --hf-auth-token $your_hf_auth_token

After a while, after training runs are kicked off, you should see output like:

Node 0: Average step time ____ ms...
Node 0: Average step time ____ ms...
Node 0: Average step time ____ ms...
Node 0: Average step time ____ ms...
...

Tear down the instances

Make sure to tear down your Terraform stack once done, to avoid paying more than necessary for GPU instances.

$ terraform destroy -var "public_key="

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages