Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates after review #3

Conversation

vvchernov
Copy link
Collaborator

@vvchernov vvchernov commented Jan 5, 2024

Updates after Masa's review #82 and some fixes:

  • Fix naming for the sake of clarity
  • Hide logprobs calculation by condition to not reduce performance
  • Add intermediate dataclass for raw logprob info
  • Use OpenAI dataclasses for logprobs in general
  • Fix logprob calculation from probs with randomization
  • Other small fixes

Note: I have some doubts:

  1. misprinting in paged_cache_model.py (see TODO)?
if do_top_p or do_top_k:
    logits = _apply_top_p_top_k(logits_random, top_ps, top_ks)

It seems that logits_random should be instead of logits
2. For parameter n > 1, only one token generated after prefill step, why are not n tokens?

cc @zxybazh, @masahi

@vvchernov vvchernov marked this pull request as ready for review January 8, 2024 10:51
@masahi
Copy link

masahi commented Jan 8, 2024

It seems that logits_random should be instead of logits

No, here we are separately doing greedy and random sampling. Only the latter is relevant for top p/k.

For parameter n > 1, only one token generated after prefill step, why are not n tokens?

Hmm interesting, I never thought about sampling a token individually for different sequences after prefill, i.e. the first tokens in each of n generations are the same. I think it is fine and simpler, but curious what others think.

@vvchernov
Copy link
Collaborator Author

Hello @masahi! Thank you for quick response!

No, here we are separately doing greedy and random sampling. Only the latter is relevant for top p/k.

The following is still not clear for me: logits from logits = _apply_top_p_top_k(logits_random, top_ps, top_ks) are not used in sample(...) function below the expression and on higher level (generate(...) function) it is also not used. Looks like we transform logits_random in _apply_top_p_top_k function and use after.
May be do it so _apply_top_p_top_k(logits_random, top_ps, top_ks) to not confuse anybody?

Hmm interesting, I never thought about sampling a token individually for different sequences after prefill, i.e. the first tokens in each of n generations are the same. I think it is fine and simpler, but curious what others think.

As I see you are right: "the first tokens in each of n generations are the same". For me it is strange why we start to randomize tokens from the second ones. And yes it is harder to implement and may be not priority. Who also can discuss and think about it?

@masahi
Copy link

masahi commented Jan 9, 2024

ah yes, you are right.

logits = _apply_top_p_top_k(logits_random, top_ps, top_ks)

should be

logits_random = _apply_top_p_top_k(logits_random, top_ps, top_ks)

I will fix it ASAP

@masahi
Copy link

masahi commented Jan 9, 2024

Who also can discuss and think about it?

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation:
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

@elvin-n
Copy link

elvin-n commented Jan 9, 2024

Who also can discuss and think about it?

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation: https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

Formally, there should be different independent sampling n times. And we can do this quite lite way handling exactly n samples for prefill, main logic might not be changed, but since it is only one token, the priority is not high

Copy link
Owner

@zxybazh zxybazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the timely PR to address the comments. LGTM except the condition for logprob sampling.

serve/mlc_serve/model/paged_cache_model.py Outdated Show resolved Hide resolved
@zxybazh
Copy link
Owner

zxybazh commented Jan 9, 2024

Thanks @vvchernov, I'll merge this PR as it's quite mature and let's continue the discussion thread to the original PR octoml#82.

@zxybazh zxybazh merged commit 4c56eac into zxybazh:feature/2023-11-22/enable-mlc-server-logprobs Jan 9, 2024
1 check passed
@vvchernov vvchernov deleted the vc/update branch January 9, 2024 11:16
@sunggg
Copy link

sunggg commented Jan 16, 2024

@sunggg What do you think? Currently in parallel sampling, the first tokens in each generation are the same since we just generate one token after prefill, which is then copied into each of generation:
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L491-L499

Interesting. I think both makes sense. What is OpenAI or vllm's behavior? Can we match the behavior with them as both approaches sound reasonable?

@masahi
Copy link

masahi commented Jan 16, 2024

I don't know about OpenAI but vLLM samples n tokens after prefill. I created octoml#161 to fix our behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants