vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

BerndHuber · 2023-09-05T15:51:18Z

Has anyone else encountered the issue that a model loaded with vllm generates low quality/gibberish output when using a local, fine tuned Llama-2-13b-hf model?

Just using the standard inference method from the vllm blog:

from vllm import LLM, SamplingParams

llm = LLM(model="/home/jupyter/tuned-Llama-2-13b-hf") --> this returns low quality
#llm = LLM(model="meta-llama/Llama-2-13b-hf") --> this works as expected
sampling_params = SamplingParams(top_p=0.9,temperature=0.9)

prompts = [
"### Instruction:\n\n### Input:\ntell me a joke.\n### Response:",
"### Instruction:\n\n### Input:\ncapital of france\n### Response:",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
  prompt = output.prompt
  generated_text = output.outputs[0].text
  print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Example output tuned model:

Prompt: '### Instruction:\n\n### Input:\ntell me a joke.\n### Response:', Generated text: 'Napoleon Doctor FALSEangel FIFALoc paradéklog;\r internally NGCзар quatre Compet probability'

Prompt: '### Instruction:\n\n### Input:\ncapital of france\n### Response:', Generated text: "wide]{'}$-poss следуdw */ achtpit AUT Ne premiersErrors уже ret math"

Example output base model:

Prompt: '### Instruction:\n\n### Input:\ntell me a joke.\n### Response:', Generated text: '\n😃 What did the tomato say to the other tomato'

Prompt: '### Instruction:\n\n### Input:\ncapital of france\n### Response:', Generated text: '\nParis\n'

The fine tuned model works as expected when I load it with huggingface:

from transformers import AutoTokenizer, AutoModelForCausalLM

output_dir = "/home/jupyter/tuned-Llama-2-13b-hf"
model = AutoModelForCausalLM.from_pretrained(output_dir, use_cache=False, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(output_dir)

def get_llama2_prediction(query):
  query = "### Instruction:\n\n### Input:\n"+query+"\n### Response:"
  input_ids = tokenizer(query, return_tensors="pt", truncation=True).input_ids.cuda()
  outputs = model.generate(input_ids=input_ids, max_new_tokens=300, do_sample=True, top_p=0.9,temperature=0.9)
  return(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(query):])

get_llama2_prediction("tell me a joke")

Any ideas how I can sample the same way with vllm?

The text was updated successfully, but these errors were encountered:

Hukongtao · 2023-09-06T08:28:45Z

+1

Se-Hun · 2023-09-07T06:29:32Z

Is it solved ? I have a same problem.

This is my examples.

Prompt : Hello, my name is, Generated text : hd Stockрю Zumiska)}{Program laughおWrappernex Catalunyaacional establish定club
Prompt : The president of the United States is, Generated text : Business Mohści beskrevsὶnexкли mavenskieatted river totale округуDen;" Independent
Prompt : The capital of France is, Generated text : pleasurebólerves Pelід програ Referencias напreader encryption go савезној transactionárs displaysboBox
Prompt : The future of AI is, Generated text : canción vie DemocraticΤ}$. Foi nov "," beyondabiшње ' Stuartzetireq Phil

Se-Hun · 2023-09-07T06:52:34Z

Issue #615 has same problem.

Missmiaom · 2023-09-08T07:34:07Z

+1

AntoninLeroy · 2023-10-23T07:12:21Z

+1 on mistral inference quality seems low compared to transformers library. Maybe a default parameter issue ?

SuperBruceJia · 2023-11-25T16:20:27Z

+1 when load the locally-saved pre-trained model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model.save_pretrained(saver_dir)
llm = LLM(saver_dir, tensor_parallel_size=1, gpu_memory_utilization=0.70)

The model continuously generates the same stuff.
Please check it:

Best regards,

Shuyue
Nov. 25th, 2023

SuperBruceJia · 2023-11-25T16:26:39Z

+1 when load the locally-saved pre-trained model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model.save_pretrained(saver_dir)
llm = LLM(saver_dir, tensor_parallel_size=1, gpu_memory_utilization=0.70)
The model continuously generates the same stuff. Please check it:

Best regards,

Shuyue Nov. 25th, 2023

My current solution is:

Instead of loading the locally saved pre-trained model, I load the model via

llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70)

And merge my trained LoRA adapter:

llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_saver_dir)

After testing, it seems working well.

However, since I am using Hugging Face AutoModelForCausalLM.from_pretrained to load the original pre-trained model to fine-tune the LoRA adapter, instead of vLLM. I will update this one after conducting more experiments.

Best regards,

Shuyue
Nov. 25th, 2023

hmellor closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

BerndHuber commented Sep 5, 2023 •

edited

Hukongtao commented Sep 6, 2023

Se-Hun commented Sep 7, 2023

Se-Hun commented Sep 7, 2023

Missmiaom commented Sep 8, 2023

AntoninLeroy commented Oct 23, 2023

SuperBruceJia commented Nov 25, 2023

SuperBruceJia commented Nov 25, 2023

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

Comments

BerndHuber commented Sep 5, 2023 • edited

Hukongtao commented Sep 6, 2023

Se-Hun commented Sep 7, 2023

Se-Hun commented Sep 7, 2023

Missmiaom commented Sep 8, 2023

AntoninLeroy commented Oct 23, 2023

SuperBruceJia commented Nov 25, 2023

SuperBruceJia commented Nov 25, 2023

BerndHuber commented Sep 5, 2023 •

edited