Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misaligned beam search result with huggingface #975

Open
Abraham-Xu opened this issue Sep 7, 2023 · 7 comments
Open

Misaligned beam search result with huggingface #975

Abraham-Xu opened this issue Sep 7, 2023 · 7 comments
Labels
bug Something isn't working stale

Comments

@Abraham-Xu
Copy link
Contributor

I am that happy that beam search PR (#857) has been merged into main branch. But I tested 2 models ( llama2-7b and baichuan13b-base ) and found that they generated different beam search results with huggingface transformers.

huggingface:
beam_outputs = model.generate(inputs.input_ids, do_sample=False, num_beams=4, num_return_sequences=4, max_new_tokens=1024)

vllm:
sampling_params = SamplingParams(temperature=0.0, use_beam_search=True, n=4, max_tokens=1024)

Is there anything wrong with my configuration? I checked vllm/tests/conftest.py and found no difference in sampling params.

@WoosukKwon
Copy link
Collaborator

Hi @Abraham-Xu, thanks for bringing it up. It maybe because the generated outputs are too long. Because beam search uses the cumulative log probabilities of the beam candidates, the cumulated numerical differences due to the use of different kernel implementations can affect the results when the outputs are long. Could you please try it again with smaller max_tokens (say 128)?

cc @zhuohan123

@Abraham-Xu
Copy link
Contributor Author

Abraham-Xu commented Sep 12, 2023

Hi, @WoosukKwon, thanks for reply. Sure the result is better with smaller max_tokens. But I doubt that the root cause of the error is from different kernel implementations because FasterTransformer also uses different kernel implementations with huggingface but it can produce the same result under FP32 precision & beam_width=4 & max_tokens=1024 which vLLM can not. I recommand to double check the implementation details of beam search.

@lichen914
Copy link

lichen914 commented Sep 12, 2023

i also have the same problem when using codellama and the max_tokens is 256 & best_of=4

@lijianxing123
Copy link

i also have the same problem when using vicuna and the max_tokens is 256 & best_of=5

@lijianxing123
Copy link

Have anyone solved this problem ? In vllm llama modle, I replaced some functions using hf transfomers funtion like LlamaMLP,qkv_proj except PagedAttentionWithRoPE, The result is different also.

@KatIsCoding
Copy link

SQLCoder is also affected by this, it is recommended by defog to run the model with beam search >= 4, but it is impossible to run it with that configuration on vllm because the output is always incomplete

@DarkLight1337 DarkLight1337 added the bug Something isn't working label May 31, 2024
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants