-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference with LLaMA 65B generates nothing but \n #450
Comments
I had the same problem. The outputs from vLLM and HF are inconsistent |
Does this happend only on 65B model? Am using 7B normally |
Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have. |
|
The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. |
Generation args are in the code above, both should use greedy decoding so shouldn't be too different (all '\n' from vllm). |
Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers. For example, the following are the scores of my evaluation with several benchmarks: GSM8k
MMLU
|
The generation params can heavily effect final model performance. |
So is it reliable to evaluate LLaMa results using your scripts? - That is really weird.. |
The same result can be stably reproduced on my V100 server. |
vllm just failed to load weights, for example, vllm has no support of safetensors yet |
vLLM does not yet support safetensors, but this does not prevent us from converting the llama model into a format similar to pytorch_model-00001-of-00003.bin then loading it with vllm. |
I have also encountered the same problem, the same prompt can not produce the same output, with sampling params for greedy, anyone is resolving this ?
|
I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation. I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels? |
I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels? See my issue here: #706 I set same params,but result are totally wrong, the robot looks like sutpid than hf version..... |
Encountered the same problem |
Yes, me too. |
Encountered the same problem,my issue |
Hi, Anyone please reproduce the answer from LLama2-7B-Chat with the prompt "hello" Because, in my case, I just get a weird answer: "@matthew-james.com" I used exactly the same code as @foamliu when using vllm with LLama2-7B-Chat . Thank you for your time and help! |
awsome!! |
Encountered the same problem when using model with dynamic rope scaling. "rope_scaling": { |
The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. Below is the reproduce code:
And HuggingFace transformers works as normal:
The text was updated successfully, but these errors were encountered: