-
Notifications
You must be signed in to change notification settings - Fork 251
Add max-autotune for CPU, update profile and fix next token calculation #1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1055
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 74f921c with merge base 8cb8a35 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Thanks for the plumbing through max_autotune @yanbing-j that part looks great to me @vmpuri Can you give the profiling/token calculation a quick pass though? |
@Jack-Khuu @vmpuri Thanks for the review! Let me clarify something in updating profile and fix next token calculation. In profiling, I add the logic of print profiling table both for CPU and GPU. In next token calculation, |
@Jack-Khuu @vmpuri Could you please help review and merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yanbing-j for the changes and pinging again
Some minor changes, then we're gtg
generate.py
Outdated
with {'sequential' if generator_args.sequential_prefill else 'parallel'} prefill,\n\ | ||
generate {num_tokens_generated} tokens, in total {tokens_sec:.02f} tokens/sec, \n\ | ||
latency_per_token_seconds: {1 / tokens_sec:.04f} s/token\n\ | ||
first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as the time to first token
first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\ |
generate.py
Outdated
@@ -831,7 +847,8 @@ def callback(x, *, done_generating=False): | |||
) | |||
print("---------------------------------------------------") | |||
|
|||
tokens_sec = num_tokens_generated / t | |||
tokens_sec = (num_tokens_generated + 1) / t | |||
next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', -1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get(..., -1 ) in the denominator, logically influcence's t
next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', -1.0)) | |
next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', 0)) |
generate.py
Outdated
generate {num_tokens_generated} tokens, in total {tokens_sec:.02f} tokens/sec, \n\ | ||
latency_per_token_seconds: {1 / tokens_sec:.04f} s/token\n\ | ||
first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\ | ||
next_token_latency_seconds: {1 / next_tokens_sec:.04f} s/token \n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also log the time next_tokens_sec
on the row above
So ultimately it'll be:
toks/sec (with first token)
sec/toks (with first token)
toks/sec (wo first token)
sec/toks (wo first token)
13d3535
to
0155e2e
Compare
@Jack-Khuu Thanks for the comments! Please review again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the logging, everything looks good
Give the merge conflict (should be minor) a look and it's set
d816cb4
to
58caf09
Compare
@Jack-Khuu Thanks for the review! I have rebased on main branch. And I also hide |
6d49401
to
74f921c
Compare
Hi @Jack-Khuu , please help merge this PR. Thanks! |
Print the average numbers of total, first token and next tokens throughput
Hi @Jack-Khuu , please help review and merge this PR. Thanks! |
Thanks for following up. I'm debugging some weird behavior with the output messages at the moment (on main) Will merge this in once that's resolved |
@Jack-Khuu Thanks! All the CI passes. Please help me update branch, because If I do the rebase, all the CI need to run again. |
) | ||
|
||
self.decode_one_token = torch.compile( | ||
self.decode_one_token, mode="reduce-overhead", fullgraph=True | ||
self.decode_one_token, fullgraph=True, **kwargs | ||
) | ||
|
||
if generator_args.compile_prefill: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass kwargs
to compile_prefill model too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, miss this feature. Create another PR to support. #1112
Thanks again for the changes @yanbing-j Merging in (I'll tweak some nits in a separate PR) |
@yanbing-j, with these changes, I observed different behavior from earlier while running With this PR's commits merged onto torchchat's main branch, I see a lot of auto-tuning benchmarking results, even for the same shapes, after I run Is it expected behavior? Thanks! |
@yanbing-j, turns out |
@sanchitintel The logs you observed from autotuning is printed by setting |
Thanks, @yanbing-j! That's what I meant. Should we disable it, as it's too verbose? Even without |
@sanchitintel Remove this config in #1112. |
This PR is to add max-autotune for CPU in torch.compile. Meanwhile, split first token and next token in the log print.