-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm reducing quality when loading local fine tuned Llama-2-13b-hf model #952
Comments
+1 |
Is it solved ? I have a same problem. This is my examples.
|
Issue #615 has same problem. |
+1 |
+1 on mistral inference quality seems low compared to transformers library. Maybe a default parameter issue ? |
+1 when load the locally-saved pre-trained model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
model.save_pretrained(saver_dir)
llm = LLM(saver_dir, tensor_parallel_size=1, gpu_memory_utilization=0.70) The model continuously generates the same stuff. Best regards, Shuyue |
My current solution is: Instead of loading the locally saved pre-trained model, I load the model via llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70) And llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=1, gpu_memory_utilization=0.70)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_saver_dir) After testing, it seems working well. However, since I am using Hugging Face Best regards, Shuyue |
Has anyone else encountered the issue that a model loaded with vllm generates low quality/gibberish output when using a local, fine tuned Llama-2-13b-hf model?
Just using the standard inference method from the vllm blog:
Example output tuned model:
Example output base model:
The fine tuned model works as expected when I load it with huggingface:
Any ideas how I can sample the same way with vllm?
The text was updated successfully, but these errors were encountered: