Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the output of the vLLM is different from that of HF #2196

Open
will-wiki opened this issue Dec 19, 2023 · 7 comments
Open

the output of the vLLM is different from that of HF #2196

will-wiki opened this issue Dec 19, 2023 · 7 comments

Comments

@will-wiki
Copy link

will-wiki commented Dec 19, 2023

Thank you very much for your outstanding work!When trying to use the internlm model, I found that the features obtained by vLLM forward for the first time are different from those obtained by HF for the same input. I want to ask why this is, is it caused by the inconsistency of the underlying implementation architecture
Here are some of the configurations for the experiment:

Environment

cuda==11.8
python==3.9
torch==2.1.0+cu118
xformers==0.0.22.post7+cu118
transformers==4.35.0
graphics_card:V100

The code used to test the results

vLLM code:
from vllm import LLM, SamplingParams
prompts = ['请介绍下爱因斯坦的生平。']
sampling_params = SamplingParams(
    temperature=0, top_p=1, max_tokens=128, repetition_penalty=1.1,
    use_beam_search=True, best_of=5)
llm = LLM(model="internlm/internlm-7b", trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)


Huggingface code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-7b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
prompts = ['请介绍下爱因斯坦的生平。']
for prompt in prompts:
    inputs = tokenizer([prompt], return_tensors="pt")
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    gen_kwargs = {"num_beams":5, "max_length": 128, "top_p": 1, "temperature": 0., "do_sample": False, "repetition_penalty": 1.1} #官方版
    output = model.generate(**inputs, **gen_kwargs)
    output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)

The input is guaranteed to be consistent, but the hidden_state of the first forward output is inconsistent

huggingface framework
input_embeds-features before being fed to the model
array([[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
        -0.001038],
       [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
        -0.021   ],
       [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
         0.01636 ],
       ...,
       [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
        -0.02856 ],
       [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
        -0.0232  ],
       [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
        -0.00787 ]], dtype=float16)
hidden_state-the output of the first round forward
array([[-0.0571 ,  1.743  ,  0.521  , ..., -1.4795 , -5.82   , -0.3972 ],
       [-0.671  , -2.166  ,  1.967  , ...,  0.2404 , -1.173  , -0.0839 ],
       [-0.8433 , -5.168  , -0.03244, ...,  5.035  ,  2.578  , -0.507  ],
       ...,
       [-0.547  , -4.03   ,  2.383  , ...,  3.295  ,  0.3582 ,  0.737  ],
       [-1.602  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  , -0.1273 ],
       [-1.817  , -5.45   ,  0.1937 , ...,  5.4    ,  3.84   , -0.3865 ]],
      dtype=float16)

vLLM framework
vLLM==0.2.2(https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl#sha256=7a8b51f0565baaa820f8dc0376e1ff5a732fcabda26397a55becd90e07b5fc63)
input_embeds-features before being fed to the model
array([[[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
         -0.001038],
        [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
         -0.021   ],
        [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
          0.01636 ],
        ...,
        [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
         -0.02856 ],
        [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
         -0.0232  ],
        [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
         -0.00787 ]]], dtype=float16)
hidden_state-the output of the first round forward
array([[[-0.0643 ,  1.74   ,  0.5254 , ..., -1.48   , -5.82   ,
         -0.3816 ],
        [-0.674  , -2.17   ,  1.966  , ...,  0.2505 , -1.162  ,
         -0.0839 ],
        [-0.8413 , -5.168  , -0.03452, ...,  5.04   ,  2.582  ,
         -0.5073 ],
        ...,
        [-0.5483 , -4.035  ,  2.38   , ...,  3.295  ,  0.3564 ,
          0.7373 ],
        [-1.603  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  ,
         -0.1282 ],
        [-1.816  , -5.45   ,  0.1952 , ...,  5.4    ,  3.836  ,
         -0.3877 ]]], dtype=float16)
@Lvjinhong
Copy link

I've also noticed this issue. When I tested with Llama2 7B, specifying the same max tokens, the hf baseline produced a Total_num_tokens of 52455 tokens, whereas my run resulted in 44670 tokens.

@Lvjinhong
Copy link

How can I set min_new_tokens to ensure the generation of the same length?

@pvtoan
Copy link

pvtoan commented Dec 21, 2023

#Hi,

I also got this problem but my case is even worse.

Actually, I used Llama2-7B-Chat-Hf with

Config info:
vllm 0.2.6
transformers 4.36.2
LLM Llama2-7B-Chat-Hf
Python 3.10.12
Ubuntu 22.04
GPU NVIDIA 4090 24gb

Code
llm = VLLM(model="meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=1, trust_remote_code=True, temperature=0.6, top_k=5, top_p=0.9, torch_dtype=torch.bfloat16, max_new_tokens=500)
llm("hello")

Answer from llm("hello"):
@matthew-mitchell.com
www.matthew-mitchell.com
Matthew Mitchell is a composer ...

My questions
Also, I don't know Llama2-7b-Chat occupied around 21GB from VRAM.
(In fact, I run exact the same code before but it takes only around 14GB VRAM)

Does anyone get weird answer and high-occupied GPU memory as me? If yes, can I know how you fixed that?

@Abigail61
Copy link

hi, I noticed you can print the hidden_state of the vllm model output of the first round forward. Is there any relevant code? Thank you very much.

@ra-MANUJ-an
Copy link

hi @will-wiki could you please answer @Abigail61 ?

@RRaphaell
Copy link

any updates on that?

@damandeep-hyprbots
Copy link

I'm facing similar issue with another model. Kindly provide an update on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants