Will Vllm Lora support SGMV in handling multi-Lora request? #2893

chenqianfzh · 2024-02-16T18:01:32Z

In the multi-lora feature, vllm refer to the BGMV in punica.

Yet in the punica project(https://github.com/punica-ai/punica), the authors said SGMV (Segmented Gather Matrix-Vector Multiplication) is more flexible. Is there a plan to support SGMV in the community?

Darinochka · 2024-02-21T08:30:42Z

There is an answer to your question here #1804

We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.

chenqianfzh · 2024-02-27T19:11:23Z

Thanks for reply!

chenqianfzh closed this as completed Feb 27, 2024

chenqianfzh mentioned this issue Feb 27, 2024

[Performance] 40% performance drop using lora vs no lora #2829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

chenqianfzh commented Feb 16, 2024

Darinochka commented Feb 21, 2024

chenqianfzh commented Feb 27, 2024

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

Will Vllm Lora support SGMV in handling multi-Lora request? #2893

Comments

chenqianfzh commented Feb 16, 2024

Darinochka commented Feb 21, 2024

chenqianfzh commented Feb 27, 2024