Inference with LLaMA 65B generates nothing but \n #450

foamliu · 2023-07-13T01:28:55Z

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. Below is the reproduce code：

from vllm import LLM, SamplingParams
args_model = '/mnt/sdb/ly/models/hf_converted_llama/65B/'
llm = LLM(model=args_model, tokenizer=args_model, tokenizer_mode='slow', dtype='float16', seed=42, tensor_parallel_size=8)
sampling_params = SamplingParams(temperature=0, max_tokens=10)
prompt = 'The capital of France is'
outputs = llm.generate(prompts=[prompt], sampling_params=sampling_params)
outputs

>>> outputs
[RequestOutput(request_id=0, prompt='The capital of France is', prompt_token_ids=[0, 450, 7483, 310, 3444, 338], outputs=[CompletionOutput(index=0, text='\n\n\n\n\n\n\n\n\n\n', token_ids=[13, 13, 13, 13, 13, 13, 13, 13, 13, 13], cumulative_logprob=-34.18291640281677, logprobs={}, finish_reason=length)], finished=True)]
>>> sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

And HuggingFace transformers works as normal:

import transformers                  
tokenizers = transformers.LlamaTokenizer.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/")
model = transformers.LlamaForCausalLM.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/", device_map="auto")  
prompt = 'The capital of France is'
inputs = tokenizers.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=10)
text = tokenizers.decode(outputs[0], skip_special_tokens=True)

>>> text
'The capital of France is Paris.\nThe capital of France is Paris.'

The text was updated successfully, but these errors were encountered:

Hukongtao · 2023-07-13T02:22:32Z

I had the same problem. The outputs from vLLM and HF are inconsistent

lucasjinreal · 2023-07-13T03:02:23Z

Does this happend only on 65B model? Am using 7B normally

HermitSun · 2023-07-13T03:42:49Z

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.

Hukongtao · 2023-07-13T08:31:01Z

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.
generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

foamliu · 2023-07-14T00:38:21Z

Does this happend only on 65B model? Am using 7B normally

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well.

foamliu · 2023-07-14T00:41:50Z

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.
generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

Generation args are in the code above, both should use greedy decoding so shouldn't be too different (all '\n' from vllm).

foamliu · 2023-07-14T00:52:12Z

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

	LLaMA 7B	LLaMA 13B	LLaMA 30B
vLLM	9.40	15.01	24.94
HF	10.46	14.86	30.40

MMLU

	LLaMA 7B	LLaMA 13B	LLaMA 30B
vLLM	35.8	46.9	48.9
HF	34.1	46.7	57.8

lucasjinreal · 2023-07-14T02:49:18Z

The generation params can heavily effect final model performance.

MM-IR · 2023-07-15T04:47:57Z

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

foamliu · 2023-07-17T06:14:36Z

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

The same result can be stably reproduced on my V100 server.

andyfeih · 2023-07-19T02:15:24Z

vllm just failed to load weights, for example, vllm has no support of safetensors yet

foamliu · 2023-07-20T00:23:20Z

vllm just failed to load weights, for example, vllm has no support of safetensors yet

vLLM does not yet support safetensors, but this does not prevent us from converting the llama model into a format similar to pytorch_model-00001-of-00003.bin then loading it with vllm.

CtfGo · 2023-07-21T04:14:35Z

I have also encountered the same problem, the same prompt can not produce the same output, with sampling params for greedy, anyone is resolving this ?

params	HF	vLLM
top_p	1.0	1.0
top_k	-1	-1
temperature	0.0	0.0

syskn · 2023-07-21T05:54:25Z

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

lw921014 · 2023-07-25T09:27:53Z

for LLAMA 65B，you'd better midfy you tokenizer bos as 1 (which is 0 for llama 13b).

luohao123 · 2023-08-09T14:18:15Z

@syskn

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

See my issue here: #706

I set same params,but result are totally wrong, the robot looks like sutpid than hf version.....

oushu1zhangxiangxuan1 · 2023-12-13T09:08:29Z

Encountered the same problem

phamkhactu · 2023-12-13T16:40:55Z

Encountered the same problem

Yes, me too.

will-wiki · 2023-12-20T02:48:09Z

Encountered the same problem，my issue

pvtoan · 2023-12-21T09:39:40Z

Hi,

Anyone please reproduce the answer from LLama2-7B-Chat with the prompt "hello"

Because, in my case, I just get a weird answer: "@matthew-james.com"

I used exactly the same code as @foamliu when using vllm with LLama2-7B-Chat .

Thank you for your time and help!

ArlanCooper · 2024-02-23T07:46:40Z

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 9.40 15.01 24.94
HF 10.46 14.86 30.40
MMLU

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 35.8 46.9 48.9
HF 34.1 46.7 57.8

awsome！！

thehir0 · 2024-05-27T19:05:02Z

Encountered the same problem when using model with dynamic rope scaling.

"rope_scaling": {
"factor": 8.0,
"type": "dynamic"
},

WoosukKwon added the bug Something isn't working label Jul 13, 2023

anujnayyar1 mentioned this issue Aug 3, 2023

EOS token being inserted into sequence. #558

Closed

BaiMoHan mentioned this issue Aug 9, 2023

Same generate params but got totally different result #706

Closed

silverriver mentioned this issue Oct 11, 2023

Inference results from vLLM is inconsistant with HF PygmalionAI/aphrodite-engine#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with LLaMA 65B generates nothing but \n #450

Inference with LLaMA 65B generates nothing but \n #450

foamliu commented Jul 13, 2023 •

edited

Loading

Hukongtao commented Jul 13, 2023

lucasjinreal commented Jul 13, 2023

HermitSun commented Jul 13, 2023

Hukongtao commented Jul 13, 2023

foamliu commented Jul 14, 2023

foamliu commented Jul 14, 2023

foamliu commented Jul 14, 2023 •

edited

Loading

lucasjinreal commented Jul 14, 2023

MM-IR commented Jul 15, 2023

foamliu commented Jul 17, 2023

andyfeih commented Jul 19, 2023

foamliu commented Jul 20, 2023

CtfGo commented Jul 21, 2023 •

edited

Loading

syskn commented Jul 21, 2023 •

edited

Loading

lw921014 commented Jul 25, 2023

luohao123 commented Aug 9, 2023

oushu1zhangxiangxuan1 commented Dec 13, 2023

phamkhactu commented Dec 13, 2023

will-wiki commented Dec 20, 2023 •

edited

Loading

pvtoan commented Dec 21, 2023

ArlanCooper commented Feb 23, 2024

thehir0 commented May 27, 2024 •

edited

Loading

Inference with LLaMA 65B generates nothing but \n #450

Inference with LLaMA 65B generates nothing but \n #450

Comments

foamliu commented Jul 13, 2023 • edited Loading

Hukongtao commented Jul 13, 2023

lucasjinreal commented Jul 13, 2023

HermitSun commented Jul 13, 2023

Hukongtao commented Jul 13, 2023

foamliu commented Jul 14, 2023

foamliu commented Jul 14, 2023

foamliu commented Jul 14, 2023 • edited Loading

lucasjinreal commented Jul 14, 2023

MM-IR commented Jul 15, 2023

foamliu commented Jul 17, 2023

andyfeih commented Jul 19, 2023

foamliu commented Jul 20, 2023

CtfGo commented Jul 21, 2023 • edited Loading

syskn commented Jul 21, 2023 • edited Loading

lw921014 commented Jul 25, 2023

luohao123 commented Aug 9, 2023

oushu1zhangxiangxuan1 commented Dec 13, 2023

phamkhactu commented Dec 13, 2023

will-wiki commented Dec 20, 2023 • edited Loading

pvtoan commented Dec 21, 2023

ArlanCooper commented Feb 23, 2024

thehir0 commented May 27, 2024 • edited Loading

foamliu commented Jul 13, 2023 •

edited

Loading

foamliu commented Jul 14, 2023 •

edited

Loading

CtfGo commented Jul 21, 2023 •

edited

Loading

syskn commented Jul 21, 2023 •

edited

Loading

will-wiki commented Dec 20, 2023 •

edited

Loading

thehir0 commented May 27, 2024 •

edited

Loading