Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952

Closed
BerndHuber opened this issue Sep 5, 2023 · 7 comments
Closed

Comments

@BerndHuber
Copy link

BerndHuber commented Sep 5, 2023

Has anyone else encountered the issue that a model loaded with vllm generates low quality/gibberish output when using a local, fine tuned Llama-2-13b-hf model?

Just using the standard inference method from the vllm blog:

from vllm import LLM, SamplingParams

llm = LLM(model="/home/jupyter/tuned-Llama-2-13b-hf") --> this returns low quality
#llm = LLM(model="meta-llama/Llama-2-13b-hf") --> this works as expected
sampling_params = SamplingParams(top_p=0.9,temperature=0.9)

prompts = [
"### Instruction:\n\n### Input:\ntell me a joke.\n### Response:",
"### Instruction:\n\n### Input:\ncapital of france\n### Response:",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
  prompt = output.prompt
  generated_text = output.outputs[0].text
  print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Example output tuned model:

Prompt: '### Instruction:\n\n### Input:\ntell me a joke.\n### Response:', Generated text: 'Napoleon Doctor FALSEangel FIFALoc paradéklog;\r internally NGCзар quatre Compet probability'

Prompt: '### Instruction:\n\n### Input:\ncapital of france\n### Response:', Generated text: "wide]{'}$-poss следуdw */ achtpit AUT Ne premiersErrors уже ret math"

Example output base model:

Prompt: '### Instruction:\n\n### Input:\ntell me a joke.\n### Response:', Generated text: '\n😃 What did the tomato say to the other tomato'

Prompt: '### Instruction:\n\n### Input:\ncapital of france\n### Response:', Generated text: '\nParis\n'

The fine tuned model works as expected when I load it with huggingface:

from transformers import AutoTokenizer, AutoModelForCausalLM

output_dir = "/home/jupyter/tuned-Llama-2-13b-hf"
model = AutoModelForCausalLM.from_pretrained(output_dir, use_cache=False, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(output_dir)

def get_llama2_prediction(query):
  query = "### Instruction:\n\n### Input:\n"+query+"\n### Response:"
  input_ids = tokenizer(query, return_tensors="pt", truncation=True).input_ids.cuda()
  outputs = model.generate(input_ids=input_ids, max_new_tokens=300, do_sample=True, top_p=0.9,temperature=0.9)
  return(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(query):])

get_llama2_prediction("tell me a joke")

Any ideas how I can sample the same way with vllm?

@Hukongtao
Copy link

+1

@Se-Hun
Copy link

Se-Hun commented Sep 7, 2023

Is it solved ? I have a same problem.

This is my examples.

Prompt : Hello, my name is, Generated text : hd Stockрю Zumiska)}{Program laughおWrappernex Catalunyaacional establish定club
Prompt : The president of the United States is, Generated text : Business Mohści beskrevsὶnexкли mavenskieatted river totale округуDen;" Independent
Prompt : The capital of France is, Generated text : pleasurebólerves Pelід програ Referencias напreader encryption go савезној transactionárs displaysboBox
Prompt : The future of AI is, Generated text : canción vie DemocraticΤ}$. Foi nov "," beyondabiшње ' Stuartzetireq Phil

@Se-Hun
Copy link

Se-Hun commented Sep 7, 2023

Issue #615 has same problem.

@Missmiaom
Copy link

+1

@AntoninLeroy
Copy link

+1 on mistral inference quality seems low compared to transformers library. Maybe a default parameter issue ?

@SuperBruceJia
Copy link

+1 when load the locally-saved pre-trained model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model.save_pretrained(saver_dir)
llm = LLM(saver_dir, tensor_parallel_size=1, gpu_memory_utilization=0.70)

The model continuously generates the same stuff.
Please check it:

image

Best regards,

Shuyue
Nov. 25th, 2023

@SuperBruceJia
Copy link

+1 when load the locally-saved pre-trained model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model.save_pretrained(saver_dir)
llm = LLM(saver_dir, tensor_parallel_size=1, gpu_memory_utilization=0.70)

The model continuously generates the same stuff. Please check it:

image

Best regards,

Shuyue Nov. 25th, 2023

My current solution is:

Instead of loading the locally saved pre-trained model, I load the model via

llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70)

And merge my trained LoRA adapter:

llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_saver_dir)

After testing, it seems working well.

However, since I am using Hugging Face AutoModelForCausalLM.from_pretrained to load the original pre-trained model to fine-tune the LoRA adapter, instead of vLLM. I will update this one after conducting more experiments.

image

Best regards,

Shuyue
Nov. 25th, 2023

@hmellor hmellor closed this as completed Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants