the output of the vLLM is different from that of HF #2196

will-wiki · 2023-12-19T09:28:39Z

Thank you very much for your outstanding work！When trying to use the internlm model, I found that the features obtained by vLLM forward for the first time are different from those obtained by HF for the same input. I want to ask why this is, is it caused by the inconsistency of the underlying implementation architecture
Here are some of the configurations for the experiment：

Environment

cuda==11.8
python==3.9
torch==2.1.0+cu118
xformers==0.0.22.post7+cu118
transformers==4.35.0
graphics_card：V100

The code used to test the results

vLLM code：
from vllm import LLM, SamplingParams
prompts = ['请介绍下爱因斯坦的生平。']
sampling_params = SamplingParams(
    temperature=0, top_p=1, max_tokens=128, repetition_penalty=1.1,
    use_beam_search=True, best_of=5)
llm = LLM(model="internlm/internlm-7b", trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)


Huggingface code：
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-7b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
prompts = ['请介绍下爱因斯坦的生平。']
for prompt in prompts:
    inputs = tokenizer([prompt], return_tensors="pt")
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    gen_kwargs = {"num_beams":5, "max_length": 128, "top_p": 1, "temperature": 0., "do_sample": False, "repetition_penalty": 1.1} #官方版
    output = model.generate(**inputs, **gen_kwargs)
    output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)

The input is guaranteed to be consistent, but the hidden_state of the first forward output is inconsistent

huggingface framework
input_embeds-features before being fed to the model
array([[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
        -0.001038],
       [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
        -0.021   ],
       [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
         0.01636 ],
       ...,
       [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
        -0.02856 ],
       [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
        -0.0232  ],
       [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
        -0.00787 ]], dtype=float16)
hidden_state-the output of the first round forward
array([[-0.0571 ,  1.743  ,  0.521  , ..., -1.4795 , -5.82   , -0.3972 ],
       [-0.671  , -2.166  ,  1.967  , ...,  0.2404 , -1.173  , -0.0839 ],
       [-0.8433 , -5.168  , -0.03244, ...,  5.035  ,  2.578  , -0.507  ],
       ...,
       [-0.547  , -4.03   ,  2.383  , ...,  3.295  ,  0.3582 ,  0.737  ],
       [-1.602  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  , -0.1273 ],
       [-1.817  , -5.45   ,  0.1937 , ...,  5.4    ,  3.84   , -0.3865 ]],
      dtype=float16)

vLLM framework
vLLM==0.2.2(https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl#sha256=7a8b51f0565baaa820f8dc0376e1ff5a732fcabda26397a55becd90e07b5fc63)
input_embeds-features before being fed to the model
array([[[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
         -0.001038],
        [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
         -0.021   ],
        [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
          0.01636 ],
        ...,
        [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
         -0.02856 ],
        [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
         -0.0232  ],
        [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
         -0.00787 ]]], dtype=float16)
hidden_state-the output of the first round forward
array([[[-0.0643 ,  1.74   ,  0.5254 , ..., -1.48   , -5.82   ,
         -0.3816 ],
        [-0.674  , -2.17   ,  1.966  , ...,  0.2505 , -1.162  ,
         -0.0839 ],
        [-0.8413 , -5.168  , -0.03452, ...,  5.04   ,  2.582  ,
         -0.5073 ],
        ...,
        [-0.5483 , -4.035  ,  2.38   , ...,  3.295  ,  0.3564 ,
          0.7373 ],
        [-1.603  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  ,
         -0.1282 ],
        [-1.816  , -5.45   ,  0.1952 , ...,  5.4    ,  3.836  ,
         -0.3877 ]]], dtype=float16)

The text was updated successfully, but these errors were encountered:

Lvjinhong · 2023-12-19T12:34:22Z

I've also noticed this issue. When I tested with Llama2 7B, specifying the same max tokens, the hf baseline produced a Total_num_tokens of 52455 tokens, whereas my run resulted in 44670 tokens.

Lvjinhong · 2023-12-19T12:41:06Z

How can I set min_new_tokens to ensure the generation of the same length?

pvtoan · 2023-12-21T09:31:12Z

#Hi,

I also got this problem but my case is even worse.

Actually, I used Llama2-7B-Chat-Hf with

Config info:
vllm 0.2.6
transformers 4.36.2
LLM Llama2-7B-Chat-Hf
Python 3.10.12
Ubuntu 22.04
GPU NVIDIA 4090 24gb

Code
llm = VLLM(model="meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=1, trust_remote_code=True, temperature=0.6, top_k=5, top_p=0.9, torch_dtype=torch.bfloat16, max_new_tokens=500)
llm("hello")

Answer from llm("hello"):
@matthew-mitchell.com
www.matthew-mitchell.com
Matthew Mitchell is a composer ...

My questions
Also, I don't know Llama2-7b-Chat occupied around 21GB from VRAM.
(In fact, I run exact the same code before but it takes only around 14GB VRAM)

Does anyone get weird answer and high-occupied GPU memory as me? If yes, can I know how you fixed that?

Abigail61 · 2024-01-26T08:01:10Z

hi, I noticed you can print the hidden_state of the vllm model output of the first round forward. Is there any relevant code? Thank you very much.

ra-MANUJ-an · 2024-03-24T09:53:26Z

hi @will-wiki could you please answer @Abigail61 ?

RRaphaell · 2024-05-28T20:33:04Z

any updates on that?

damandeep-hyprbots · 2024-07-19T09:39:01Z

I'm facing similar issue with another model. Kindly provide an update on this.

will-wiki mentioned this issue Dec 20, 2023

Inference with LLaMA 65B generates nothing but \n #450

Open

Thewillman mentioned this issue Dec 27, 2023

Does the continuous batching technology in the vLLM online service scenario contain the concept of batch size? #2257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the output of the vLLM is different from that of HF #2196

the output of the vLLM is different from that of HF #2196

will-wiki commented Dec 19, 2023 •

edited

Loading

Lvjinhong commented Dec 19, 2023

Lvjinhong commented Dec 19, 2023

pvtoan commented Dec 21, 2023 •

edited

Loading

Abigail61 commented Jan 26, 2024

ra-MANUJ-an commented Mar 24, 2024

RRaphaell commented May 28, 2024

damandeep-hyprbots commented Jul 19, 2024

the output of the vLLM is different from that of HF #2196

the output of the vLLM is different from that of HF #2196

Comments

will-wiki commented Dec 19, 2023 • edited Loading

Lvjinhong commented Dec 19, 2023

Lvjinhong commented Dec 19, 2023

pvtoan commented Dec 21, 2023 • edited Loading

Abigail61 commented Jan 26, 2024

ra-MANUJ-an commented Mar 24, 2024

RRaphaell commented May 28, 2024

damandeep-hyprbots commented Jul 19, 2024

will-wiki commented Dec 19, 2023 •

edited

Loading

pvtoan commented Dec 21, 2023 •

edited

Loading