-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the output of the vLLM is different from that of HF #2196
Comments
I've also noticed this issue. When I tested with Llama2 7B, specifying the same max tokens, the hf baseline produced a Total_num_tokens of 52455 tokens, whereas my run resulted in 44670 tokens. |
How can I set min_new_tokens to ensure the generation of the same length? |
#Hi, I also got this problem but my case is even worse. Actually, I used Llama2-7B-Chat-Hf with Config info: Code Answer from llm("hello"): My questions Does anyone get weird answer and high-occupied GPU memory as me? If yes, can I know how you fixed that? |
hi, I noticed you can print the hidden_state of the vllm model output of the first round forward. Is there any relevant code? Thank you very much. |
hi @will-wiki could you please answer @Abigail61 ? |
any updates on that? |
I'm facing similar issue with another model. Kindly provide an update on this. |
Thank you very much for your outstanding work!When trying to use the internlm model, I found that the features obtained by vLLM forward for the first time are different from those obtained by HF for the same input. I want to ask why this is, is it caused by the inconsistency of the underlying implementation architecture
Here are some of the configurations for the experiment:
Environment
The code used to test the results
The input is guaranteed to be consistent, but the hidden_state of the first forward output is inconsistent
The text was updated successfully, but these errors were encountered: