Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference with LLaMA 65B generates nothing but \n #450

Open
foamliu opened this issue Jul 13, 2023 · 22 comments
Open

Inference with LLaMA 65B generates nothing but \n #450

foamliu opened this issue Jul 13, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@foamliu
Copy link

foamliu commented Jul 13, 2023

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well. Below is the reproduce code:

from vllm import LLM, SamplingParams
args_model = '/mnt/sdb/ly/models/hf_converted_llama/65B/'
llm = LLM(model=args_model, tokenizer=args_model, tokenizer_mode='slow', dtype='float16', seed=42, tensor_parallel_size=8)
sampling_params = SamplingParams(temperature=0, max_tokens=10)
prompt = 'The capital of France is'
outputs = llm.generate(prompts=[prompt], sampling_params=sampling_params)
outputs
>>> outputs
[RequestOutput(request_id=0, prompt='The capital of France is', prompt_token_ids=[0, 450, 7483, 310, 3444, 338], outputs=[CompletionOutput(index=0, text='\n\n\n\n\n\n\n\n\n\n', token_ids=[13, 13, 13, 13, 13, 13, 13, 13, 13, 13], cumulative_logprob=-34.18291640281677, logprobs={}, finish_reason=length)], finished=True)]
>>> sampling_params = SamplingParams(temperature=0.1, max_tokens=10)

And HuggingFace transformers works as normal:

import transformers                  
tokenizers = transformers.LlamaTokenizer.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/")
model = transformers.LlamaForCausalLM.from_pretrained("/mnt/sdb/ly/models/hf_converted_llama/65B/", device_map="auto")  
prompt = 'The capital of France is'
inputs = tokenizers.encode(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=10)
text = tokenizers.decode(outputs[0], skip_special_tokens=True)
>>> text
'The capital of France is Paris.\nThe capital of France is Paris.'
@Hukongtao
Copy link

I had the same problem. The outputs from vLLM and HF are inconsistent

@lucasjinreal
Copy link

Does this happend only on 65B model? Am using 7B normally

@HermitSun
Copy link
Contributor

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.

@Hukongtao
Copy link

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.
generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

@WoosukKwon WoosukKwon added the bug Something isn't working label Jul 13, 2023
@foamliu
Copy link
Author

foamliu commented Jul 14, 2023

Does this happend only on 65B model? Am using 7B normally

The problem only happens with LLaMA 65B, LLaMA 7B/13B/30B work well.

@foamliu
Copy link
Author

foamliu commented Jul 14, 2023

Is there any difference between generation args of vllm and hf? It seems that vllm has some args that hf does not have.
generation args of vllm: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1161

Generation args are in the code above, both should use greedy decoding so shouldn't be too different (all '\n' from vllm).

@foamliu
Copy link
Author

foamliu commented Jul 14, 2023

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 9.40 15.01 24.94
HF 10.46 14.86 30.40

MMLU

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 35.8 46.9 48.9
HF 34.1 46.7 57.8

@lucasjinreal
Copy link

The generation params can heavily effect final model performance.

@MM-IR
Copy link

MM-IR commented Jul 15, 2023

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

@foamliu
Copy link
Author

foamliu commented Jul 17, 2023

So is it reliable to evaluate LLaMa results using your scripts? - That is really weird..

The same result can be stably reproduced on my V100 server.

@andyfeih
Copy link

vllm just failed to load weights, for example, vllm has no support of safetensors yet

@foamliu
Copy link
Author

foamliu commented Jul 20, 2023

vllm just failed to load weights, for example, vllm has no support of safetensors yet

vLLM does not yet support safetensors, but this does not prevent us from converting the llama model into a format similar to pytorch_model-00001-of-00003.bin then loading it with vllm.

@CtfGo
Copy link

CtfGo commented Jul 21, 2023

I have also encountered the same problem, the same prompt can not produce the same output, with sampling params for greedy, anyone is resolving this ?

params HF vLLM
top_p 1.0 1.0
top_k -1 -1
temperature 0.0 0.0

@syskn
Copy link

syskn commented Jul 21, 2023

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

@lw921014
Copy link

image
for LLAMA 65B,you'd better midfy you tokenizer bos as 1 (which is 0 for llama 13b).

@luohao123
Copy link

@syskn

I have been trying various models and outputs from vLLM I get are consistently and significantly more deterministic (tends to work like greedy decoding and have severe repetition issue with temperature below 0.7) than HF implementation.

I compared through sampling process and I could not find a difference - if greedy doesn't match, then it could be something in PagedAttention or cuda kernels?

See my issue here: #706

I set same params,but result are totally wrong, the robot looks like sutpid than hf version.....

@oushu1zhangxiangxuan1
Copy link
Contributor

Encountered the same problem

@phamkhactu
Copy link

Encountered the same problem

Yes, me too.

@will-wiki
Copy link

will-wiki commented Dec 20, 2023

Encountered the same problem,my issue

@pvtoan
Copy link

pvtoan commented Dec 21, 2023

Hi,

Anyone please reproduce the answer from LLama2-7B-Chat with the prompt "hello"

Because, in my case, I just get a weird answer: "@matthew-james.com"

I used exactly the same code as @foamliu when using vllm with LLama2-7B-Chat .

Thank you for your time and help!

@ArlanCooper
Copy link

I had the same problem. The outputs from vLLM and HF are inconsistent

Yes, I found this problem too. In the case of greedy decoding, although LLaMA 7B, 13B, 30B can get meaningful output, the output results are different from HF transformers.

For example, the following are the scores of my evaluation with several benchmarks:

GSM8k

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 9.40 15.01 24.94
HF 10.46 14.86 30.40
MMLU

LLaMA 7B LLaMA 13B LLaMA 30B
vLLM 35.8 46.9 48.9
HF 34.1 46.7 57.8

awsome!!

@thehir0
Copy link

thehir0 commented May 27, 2024

Encountered the same problem when using model with dynamic rope scaling.

"rope_scaling": {
"factor": 8.0,
"type": "dynamic"
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests