Add max-autotune for CPU, update profile and fix next token calculation #1055

yanbing-j · 2024-08-23T08:57:30Z

This PR is to add max-autotune for CPU in torch.compile. Meanwhile, split first token and next token in the log print.

pytorch-bot · 2024-08-23T08:57:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1055

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74f921c with merge base 8cb8a35 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu · 2024-08-25T21:23:13Z

Thanks for the plumbing through max_autotune @yanbing-j that part looks great to me

@vmpuri Can you give the profiling/token calculation a quick pass though?

yanbing-j · 2024-08-26T01:17:46Z

@Jack-Khuu @vmpuri Thanks for the review! Let me clarify something in updating profile and fix next token calculation.

In profiling, I add the logic of print profiling table both for CPU and GPU. In next token calculation, t includes first token (prefill) and next token (decode_n_tokens). num_tokens_generated is next token length, therefore, t and num_tokens_generated are not match. I suppose this should be a typo when adding first token time. And I also add first token latency and next token latency in the print log seperately.

yanbing-j · 2024-08-28T05:12:45Z

@Jack-Khuu @vmpuri Could you please help review and merge this PR?

Jack-Khuu

Thanks @yanbing-j for the changes and pinging again

Some minor changes, then we're gtg

Jack-Khuu · 2024-08-28T19:15:24Z

generate.py

+                    with {'sequential' if generator_args.sequential_prefill else 'parallel'} prefill,\n\
+                    generate {num_tokens_generated} tokens, in total {tokens_sec:.02f} tokens/sec, \n\
+                    latency_per_token_seconds: {1 / tokens_sec:.04f} s/token\n\
+                    first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\


This is the same as the time to first token

Suggested change

first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\

Jack-Khuu · 2024-08-28T19:18:09Z

generate.py

@@ -831,7 +847,8 @@ def callback(x, *, done_generating=False):
                )
                print("---------------------------------------------------")

-            tokens_sec = num_tokens_generated / t
+            tokens_sec = (num_tokens_generated + 1) / t
+            next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', -1.0))


.get(..., -1 ) in the denominator, logically influcence's t

Suggested change

next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', -1.0))

next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', 0))

Jack-Khuu · 2024-08-28T19:22:25Z

generate.py

+                    generate {num_tokens_generated} tokens, in total {tokens_sec:.02f} tokens/sec, \n\
+                    latency_per_token_seconds: {1 / tokens_sec:.04f} s/token\n\
+                    first_token_latency_seconds: {aggregate_metrics.get('time_to_first_token', -1.0):.02f} s/token \n\
+                    next_token_latency_seconds: {1 / next_tokens_sec:.04f} s/token \n\


Let's also log the time next_tokens_sec on the row above

So ultimately it'll be:

toks/sec (with first token)
sec/toks (with first token)
toks/sec (wo first token)
sec/toks (wo first token)

yanbing-j · 2024-08-30T08:11:46Z

@Jack-Khuu Thanks for the comments! Please review again!

Jack-Khuu

Thanks for updating the logging, everything looks good

Give the merge conflict (should be minor) a look and it's set

yanbing-j · 2024-09-03T13:45:14Z

@Jack-Khuu Thanks for the review!

I have rebased on main branch. And I also hide tokens_sec with jit compilation time. Then the average throughput will be more accurate. Meanwhile, print out the average of total throughput, first token throughput and next tokens throughput.

yanbing-j · 2024-09-04T09:09:57Z

Hi @Jack-Khuu , please help merge this PR. Thanks!

Print the average numbers of total, first token and next tokens throughput

yanbing-j · 2024-09-05T04:32:54Z

Hi @Jack-Khuu , please help review and merge this PR. Thanks!

Jack-Khuu · 2024-09-05T07:41:36Z

Thanks for following up. I'm debugging some weird behavior with the output messages at the moment (on main)

Will merge this in once that's resolved

yanbing-j · 2024-09-05T08:39:56Z

@Jack-Khuu Thanks! All the CI passes. Please help me update branch, because If I do the rebase, all the CI need to run again.

jgong5 · 2024-09-05T12:54:01Z

torchchat/generate.py

                )

            self.decode_one_token = torch.compile(
-                self.decode_one_token, mode="reduce-overhead", fullgraph=True
+                self.decode_one_token, fullgraph=True, **kwargs
            )

            if generator_args.compile_prefill:


Pass kwargs to compile_prefill model too?

Yes, miss this feature. Create another PR to support. #1112

Jack-Khuu · 2024-09-05T20:34:09Z

Thanks again for the changes @yanbing-j

Merging in (I'll tweak some nits in a separate PR)

sanchitintel · 2024-09-05T20:37:45Z

@yanbing-j, with these changes, I observed different behavior from earlier while running generate.py.
Not sure if it's because of this PR, or because of other changes introduced in torchchat.

With this PR's commits merged onto torchchat's main branch, I see a lot of auto-tuning benchmarking results, even for the same shapes, after I run python3 torchchat.py generate llama3.1 --prompt 'Hello my name is' --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0}}' --compile --num-samples 5 --device cpu --tokenizer-path /localdisk/sanchitj/llama_3.1/original/tokenizer.model --max-autotune

Is it expected behavior? Thanks!

sanchitintel · 2024-09-05T20:54:27Z

@yanbing-j, turns out torch._inductor.config.trace.log_autotuning_results = True is simply displaying more auto-tuning results, but that's fine since auto-tuning is not being done for duplicate input shapes, so it's just that enabling this logging results in duplicate data being printed.

yanbing-j · 2024-09-06T01:55:38Z

@sanchitintel The logs you observed from autotuning is printed by setting torch._inductor.config.trace.log_autotuning_results = True.

sanchitintel · 2024-09-06T01:58:56Z

Thanks, @yanbing-j! That's what I meant.

Should we disable it, as it's too verbose? Even without torch._inductor.config.trace.log_autotuning_results = True, we get benchmarking logs for all unique input shapes. Thanks!

yanbing-j · 2024-09-06T02:41:52Z

@sanchitintel Remove this config in #1112.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 23, 2024

Jack-Khuu reviewed Aug 28, 2024

View reviewed changes

yanbing-j force-pushed the yanbing/update branch from 13d3535 to 0155e2e Compare August 30, 2024 08:11

Jack-Khuu approved these changes Sep 3, 2024

View reviewed changes

yanbing-j force-pushed the yanbing/update branch from d816cb4 to 58caf09 Compare September 3, 2024 13:43

yanbing-j force-pushed the yanbing/update branch 2 times, most recently from 6d49401 to 74f921c Compare September 4, 2024 06:39

yanbing-j added 4 commits September 4, 2024 14:38

Add max-autotune for CPU, update profile and fix next token calculation

a21e3cf

Update based on comments

8a53563

Add log config for autotuning

077f0e5

Hide tokens_sec including jit compilation time

74f921c

Print the average numbers of total, first token and next tokens throughput

jgong5 suggested changes Sep 5, 2024

View reviewed changes

Jack-Khuu merged commit d58923e into pytorch:main Sep 5, 2024
51 checks passed

Jack-Khuu mentioned this pull request Sep 6, 2024

[Easy] Improve Print Formatting for generate stats #1113

Merged

	next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', -1.0))
	next_tokens_sec = num_tokens_generated / (t - aggregate_metrics.get('time_to_first_token', 0))

Add max-autotune for CPU, update profile and fix next token calculation #1055

Add max-autotune for CPU, update profile and fix next token calculation #1055

Uh oh!

Conversation

yanbing-j commented Aug 23, 2024

Uh oh!

pytorch-bot bot commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1055

✅ No Failures

Uh oh!

Jack-Khuu commented Aug 25, 2024

Uh oh!

yanbing-j commented Aug 26, 2024

Uh oh!

yanbing-j commented Aug 28, 2024

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

yanbing-j commented Aug 30, 2024

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

yanbing-j commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanbing-j commented Sep 4, 2024

Uh oh!

yanbing-j commented Sep 5, 2024

Uh oh!

Jack-Khuu commented Sep 5, 2024

Uh oh!

yanbing-j commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 Sep 5, 2024

Choose a reason for hiding this comment

Uh oh!

yanbing-j Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu commented Sep 5, 2024

Uh oh!

Uh oh!

sanchitintel commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanbing-j commented Sep 6, 2024

Uh oh!

sanchitintel commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanbing-j commented Sep 6, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 23, 2024 •

edited

Loading

yanbing-j commented Sep 3, 2024 •

edited

Loading

yanbing-j commented Sep 5, 2024 •

edited

Loading

sanchitintel commented Sep 5, 2024 •

edited

Loading

sanchitintel commented Sep 5, 2024 •

edited

Loading

sanchitintel commented Sep 6, 2024 •

edited

Loading