[Core] Support LoRA on quantized models #4012

jeejeelee · 2024-04-11T16:08:36Z

Building upon the excellent work done in #2828

Since there hasn't been much progress on #2828,so I'd like to continue and complete this feature.

Compared to #2828, the main improvement is the addition of support for tensor parallelism.

@fmmoret Please feel free to contact me if there are any inappropriate issues.

vllm/config.py

jeejeelee · 2024-04-12T03:53:53Z

@Yard1 All checks have passed. Could I trouble you to please review it again?

Yard1

LGTM

fmmoret · 2024-04-12T04:32:43Z

Very nice -- ty for closing out the last chores 🎉

c3-ali · 2024-05-29T01:09:45Z

tests/lora/test_quant_model.py

+    output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
+
+    del llm_tp1
+    cleanup()


@Yard1 @jeejeelee I'm not able to release GPU memory usage using cleanup or del llm_tp1. Are you sure this would work (if CI had multiple GPUs and we could enable this test)?
On 0.4.2 version, if I start vllm.LLM I can't find any way to release the GPU memory again. I have to kill the process. Do you know any other way?

I forgot a bit, but I'm quite sure test_baichuan.py can be tested using tp=2. I've tested it myself.

I have tried it using the following code, GPU memory can release normly, however, I didn't use the released 0.4.2 version, I used f12c3b5 instead.

@pytest.mark.parametrize("model", MODELS) # @pytest.mark.skip("Requires multiple GPUs") def test_quant_model_tp_equality(tinyllama_lora_files, model): # Cannot use as it will initialize torch.cuda too early... # if torch.cuda.device_count() < 2: # pytest.skip(f"Not enough GPUs for tensor parallelism {2}") llm_tp1 = vllm.LLM(model=model.model_path, enable_lora=True, max_num_seqs=16, max_loras=4, tensor_parallel_size=1, quantization=model.quantization, gpu_memory_utilization=0.8, trust_remote_code=True) output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1) del llm_tp1 cleanup() llm_tp1 = vllm.LLM(model=model.model_path, enable_lora=True, max_num_seqs=16, max_loras=4, tensor_parallel_size=1, quantization=model.quantization, gpu_memory_utilization=0.8, trust_remote_code=True) output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1) del llm_tp1 cleanup()

I don't have any multi-GPU machines available right now, I will use tp=2 for testing again later.

Great, thanks for checking that. I believe the tensor_parallel_size=2 uses multiprocessing and so the behavior is different from what I experimented with. I'll post a new issue.

jeejeelee added 4 commits April 11, 2024 23:52

quant model support lora

0d02f51

delete code

8fb29e3

fix code

0da4bac

fix code

b587483

Yard1 self-requested a review April 11, 2024 17:18

Yard1 self-assigned this Apr 11, 2024

fix test

fb69d46

Yard1 reviewed Apr 12, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

fix log

c441908

Yard1 approved these changes Apr 12, 2024

View reviewed changes

Yard1 changed the title ~~[LoRA]quantized models support lora~~ [Core] Support LoRA on quantized models Apr 12, 2024

Yard1 merged commit 1096717 into vllm-project:main Apr 12, 2024
35 checks passed

rkooo567 mentioned this pull request Apr 12, 2024

[Test] Test multiple attn backend for chunked prefill. #4023

Merged

jeejeelee mentioned this pull request Apr 12, 2024

[Bugfix] Fix LoRA bug #4032

Merged

andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024

[Core] Support LoRA on quantized models (vllm-project#4012)

12d4068

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Core] Support LoRA on quantized models (vllm-project#4012)

f7b4d44

whyiug mentioned this pull request Apr 28, 2024

Combine multi-LoRA and quantization #2601

Open

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

c3-ali reviewed May 29, 2024

View reviewed changes

StrikerRUS mentioned this pull request Jun 14, 2024

[Feature]: LoRA support for Mixtral GPTQ and AWQ #5540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support LoRA on quantized models #4012

[Core] Support LoRA on quantized models #4012

jeejeelee commented Apr 11, 2024

jeejeelee commented Apr 12, 2024

Yard1 left a comment

fmmoret commented Apr 12, 2024

c3-ali May 29, 2024

jeejeelee May 29, 2024

jeejeelee May 29, 2024

jeejeelee May 29, 2024

c3-ali May 29, 2024

[Core] Support LoRA on quantized models #4012

[Core] Support LoRA on quantized models #4012

Conversation

jeejeelee commented Apr 11, 2024

jeejeelee commented Apr 12, 2024

Yard1 left a comment

Choose a reason for hiding this comment

fmmoret commented Apr 12, 2024

c3-ali May 29, 2024

Choose a reason for hiding this comment

jeejeelee May 29, 2024

Choose a reason for hiding this comment

jeejeelee May 29, 2024

Choose a reason for hiding this comment

jeejeelee May 29, 2024

Choose a reason for hiding this comment

c3-ali May 29, 2024

Choose a reason for hiding this comment