Updates after review #3

vvchernov · 2024-01-05T07:13:59Z

Updates after Masa's review #82 and some fixes:

Fix naming for the sake of clarity
Hide logprobs calculation by condition to not reduce performance
Add intermediate dataclass for raw logprob info
Use OpenAI dataclasses for logprobs in general
Fix logprob calculation from probs with randomization
Other small fixes

Note: I have some doubts:

misprinting in paged_cache_model.py (see TODO)?

if do_top_p or do_top_k:
    logits = _apply_top_p_top_k(logits_random, top_ps, top_ks)

It seems that logits_random should be instead of logits
2. For parameter n > 1, only one token generated after prefill step, why are not n tokens?

cc @zxybazh, @masahi

…s info, fix logprobs_random

masahi · 2024-01-08T19:45:51Z

It seems that logits_random should be instead of logits

No, here we are separately doing greedy and random sampling. Only the latter is relevant for top p/k.

For parameter n > 1, only one token generated after prefill step, why are not n tokens?

Hmm interesting, I never thought about sampling a token individually for different sequences after prefill, i.e. the first tokens in each of n generations are the same. I think it is fine and simpler, but curious what others think.

vvchernov · 2024-01-09T06:44:49Z

Hello @masahi! Thank you for quick response!

No, here we are separately doing greedy and random sampling. Only the latter is relevant for top p/k.

The following is still not clear for me: logits from logits = _apply_top_p_top_k(logits_random, top_ps, top_ks) are not used in sample(...) function below the expression and on higher level (generate(...) function) it is also not used. Looks like we transform logits_random in _apply_top_p_top_k function and use after.
May be do it so _apply_top_p_top_k(logits_random, top_ps, top_ks) to not confuse anybody?

Hmm interesting, I never thought about sampling a token individually for different sequences after prefill, i.e. the first tokens in each of n generations are the same. I think it is fine and simpler, but curious what others think.

As I see you are right: "the first tokens in each of n generations are the same". For me it is strange why we start to randomize tokens from the second ones. And yes it is harder to implement and may be not priority. Who also can discuss and think about it?

masahi · 2024-01-09T07:07:09Z

ah yes, you are right.

logits = _apply_top_p_top_k(logits_random, top_ps, top_ks)

should be

logits_random = _apply_top_p_top_k(logits_random, top_ps, top_ks)

I will fix it ASAP

masahi · 2024-01-09T07:13:41Z

Who also can discuss and think about it?

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation:
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

elvin-n · 2024-01-09T07:48:30Z

Who also can discuss and think about it?

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation: https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

Formally, there should be different independent sampling n times. And we can do this quite lite way handling exactly n samples for prefill, main logic might not be changed, but since it is only one token, the priority is not high

zxybazh

Thanks for the timely PR to address the comments. LGTM except the condition for logprob sampling.

serve/mlc_serve/model/paged_cache_model.py

zxybazh · 2024-01-09T11:03:31Z

Thanks @vvchernov, I'll merge this PR as it's quite mature and let's continue the discussion thread to the original PR octoml#82.

sunggg · 2024-01-16T16:29:08Z

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation:
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

Interesting. I think both makes sense. What is OpenAI or vllm's behavior? Can we match the behavior with them as both approaches sound reasonable?

masahi · 2024-01-16T19:24:21Z

I don't know about OpenAI but vLLM samples n tokens after prefill. I created octoml#161 to fix our behavior.

updates after review

4c67afb

vvchernov mentioned this pull request Jan 5, 2024

Enable Logprobs in MLC Batch Serving octoml/mlc-llm#82

Merged

vvchernov marked this pull request as draft January 5, 2024 19:15

vvchernov force-pushed the vc/update branch 5 times, most recently from 6529ee0 to 11d9b36 Compare January 8, 2024 06:39

hide logprobs calculation by condition, add dataclass for raw logprob…

294815b

…s info, fix logprobs_random

vvchernov force-pushed the vc/update branch from 11d9b36 to 294815b Compare January 8, 2024 07:27

vvchernov added 2 commits January 8, 2024 13:03

convert RawLogprobsInfo to LogprobsContent

8922a42

remove LOGPROBS_TYPE

7965267

vvchernov force-pushed the vc/update branch from cb9bd88 to 7965267 Compare January 8, 2024 09:24

fix result collection

b8748fc

vvchernov marked this pull request as ready for review January 8, 2024 10:51

vvchernov added 2 commits January 8, 2024 15:03

clean code

72b949a

more clean

ae81ccb

zxybazh reviewed Jan 9, 2024

View reviewed changes

serve/mlc_serve/model/paged_cache_model.py Outdated Show resolved Hide resolved

fix condition

b2850ba

zxybazh merged commit 4c56eac into zxybazh:feature/2023-11-22/enable-mlc-server-logprobs Jan 9, 2024
1 check passed

vvchernov deleted the vc/update branch January 9, 2024 11:16

masahi mentioned this pull request Jan 16, 2024

[Bug] Parallel sampling: The new token after prefill is duplicated for all generations octoml/mlc-llm#161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates after review #3

Updates after review #3

vvchernov commented Jan 5, 2024 •

edited

Loading

masahi commented Jan 8, 2024 •

edited

Loading

vvchernov commented Jan 9, 2024

masahi commented Jan 9, 2024

masahi commented Jan 9, 2024

elvin-n commented Jan 9, 2024

zxybazh left a comment

zxybazh commented Jan 9, 2024

sunggg commented Jan 16, 2024

masahi commented Jan 16, 2024

Updates after review #3

Updates after review #3

Conversation

vvchernov commented Jan 5, 2024 • edited Loading

masahi commented Jan 8, 2024 • edited Loading

vvchernov commented Jan 9, 2024

masahi commented Jan 9, 2024

masahi commented Jan 9, 2024

elvin-n commented Jan 9, 2024

zxybazh left a comment

Choose a reason for hiding this comment

zxybazh commented Jan 9, 2024

sunggg commented Jan 16, 2024

masahi commented Jan 16, 2024

vvchernov commented Jan 5, 2024 •

edited

Loading

masahi commented Jan 8, 2024 •

edited

Loading