-
Notifications
You must be signed in to change notification settings - Fork 712
Add warmup for Llama #5756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warmup for Llama #5756
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5756
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 3668b26 with merge base 905b88c ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
0c5afba to
02efbeb
Compare
02efbeb to
3668b26
Compare
|
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
| bool warming = false); | ||
| ::executorch::runtime::Error warmup( | ||
| const std::string& prompt, | ||
| int32_t seq_len = 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move seq_len out to a constant variable DEFAULT_SEQ_LEN that is shared by both generate and warmup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so warmup is done on the same prompt/seq_len as a real run today. Default is just something I followed from generate().
| ET_LOG( | ||
| Info, | ||
| if (!warmup) { | ||
| printf("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safe_printf?
| if (warmup) { | ||
| ET_LOG(Info, "Doing a warmup run..."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add this before generate in Runner::warmup?
|
|
||
|
|
||
| RUNTIME_ARGS="--model_path=${EXPORTED_MODEL_NAME} --tokenizer_path=tokenizer.bin --prompt=Once --temperature=0 --seq_len=10" | ||
| RUNTIME_ARGS="--model_path=${EXPORTED_MODEL_NAME} --tokenizer_path=tokenizer.bin --prompt=Once --temperature=0 --seq_len=10 --warmup=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why warm up in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question. This is so we don't break things with warmup. I debated this and was thinking about doing two runs and compare outputs w/ and w/o warmup, but CI is expensive so just did w/ warmup and compared output after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at my nits. otherwise is fine
|
@digantdesai merged this pull request in 660ef77. |
| } | ||
|
|
||
| Error Runner::warmup(const std::string& prompt, int32_t seq_len) { | ||
| Error err = generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be prefill, right? not generate? or generate calls prefill? And you are running not warming up by prefill only but entire sequence generation?
also how do you enable this for llava
Load the model. Run the everything twice. Reset stats in between two runs. Also decrease logging level for warm up.
Notes:
Tested on Android and Mac. With Llama2 and Llama3 - with temperature=0 produces same output.
This warm up option is disabled by default.
This is inspired from llama.cpp options [1, 2].
Sample runs