Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support LoRA on quantized models #4012

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

jeejeelee
Copy link
Contributor

Building upon the excellent work done in #2828

Since there hasn't been much progress on #2828,so I'd like to continue and complete this feature.

Compared to #2828, the main improvement is the addition of support for tensor parallelism.

@fmmoret Please feel free to contact me if there are any inappropriate issues.

@Yard1 Yard1 self-requested a review April 11, 2024 17:18
@Yard1 Yard1 self-assigned this Apr 11, 2024
vllm/config.py Outdated Show resolved Hide resolved
@jeejeelee
Copy link
Contributor Author

@Yard1 All checks have passed. Could I trouble you to please review it again?

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Yard1 Yard1 changed the title [LoRA]quantized models support lora [Core] Support LoRA on quantized models Apr 12, 2024
@Yard1 Yard1 merged commit 1096717 into vllm-project:main Apr 12, 2024
35 checks passed
@fmmoret
Copy link

fmmoret commented Apr 12, 2024

Very nice -- ty for closing out the last chores 🎉

output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)

del llm_tp1
cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yard1 @jeejeelee I'm not able to release GPU memory usage using cleanup or del llm_tp1. Are you sure this would work (if CI had multiple GPUs and we could enable this test)?
On 0.4.2 version, if I start vllm.LLM I can't find any way to release the GPU memory again. I have to kill the process. Do you know any other way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot a bit, but I'm quite sure test_baichuan.py can be tested using tp=2. I've tested it myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried it using the following code, GPU memory can release normly, however, I didn't use the released 0.4.2 version, I used f12c3b5 instead.

@pytest.mark.parametrize("model", MODELS)
# @pytest.mark.skip("Requires multiple GPUs")
def test_quant_model_tp_equality(tinyllama_lora_files, model):
    # Cannot use as it will initialize torch.cuda too early...
    # if torch.cuda.device_count() < 2:
    #     pytest.skip(f"Not enough GPUs for tensor parallelism {2}")

    llm_tp1 = vllm.LLM(model=model.model_path,
                       enable_lora=True,
                       max_num_seqs=16,
                       max_loras=4,
                       tensor_parallel_size=1,
                       quantization=model.quantization,
                       gpu_memory_utilization=0.8,
                       trust_remote_code=True)
    output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)

    del llm_tp1
    cleanup()
    llm_tp1 = vllm.LLM(model=model.model_path,
                       enable_lora=True,
                       max_num_seqs=16,
                       max_loras=4,
                       tensor_parallel_size=1,
                       quantization=model.quantization,
                       gpu_memory_utilization=0.8,
                       trust_remote_code=True)
    output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
    
    del llm_tp1
    cleanup()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any multi-GPU machines available right now, I will use tp=2 for testing again later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for checking that. I believe the tensor_parallel_size=2 uses multiprocessing and so the behavior is different from what I experimented with. I'll post a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants