LoRA support #5968

TaQuangTu · 2023-06-21T03:54:06Z

Is your feature request related to a problem? Please describe.
This is a feature related to how to deploy a model with LoRA supported.

Describe the solution you'd like
I have a UNet model deployed on Triton Inference Server with TensorRT backend. I also have tens of LoRA weights (in Torch format) to be applied on the Unet model.

For each LoRA weight, I manually clone a UNet model (in Torch format), merge the LoRA weight into the UNet by formula:
W_new = W_unet + BA where A and B is weight matrices in the given LoRA weight.

And then, manually convert the merged model to TensorRT and serve on Triton Inference Server.

With Triton, I would like to have an API/function to dynamically modify intrinsic values of running models, in this case UNet.

Additional context
I think this is a well explained article to get to know LoRA, hope it help: https://lightning.ai/pages/community/tutorial/lora-llm/

The text was updated successfully, but these errors were encountered:

kthui · 2023-06-21T18:51:23Z

Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5053

TaQuangTu · 2023-06-22T03:56:31Z

@kthui Thank you, it would be great if you could send us back a rough time estimate for completing it soon.

kthui · 2023-06-23T00:06:08Z

cc @Christina-Young-NVIDIA for the time estimate.

rmccorm4 · 2023-07-10T23:14:43Z

@tanmayv25 I think there was a similar request in the past to support TRT's refit API, but we ultimately went with just reloading the model, right?

I think the same applies here, the model should just be reloaded, which there are APIs for.

tanmayv25 · 2023-07-10T23:43:47Z

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency.
Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

foocker · 2023-09-18T03:55:45Z

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency. Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

so, the best way is: transform the merged weights to trt, and the do as the usual way on triton?

kthui added the enhancement New feature or request label Jun 21, 2023

GayHub1010 mentioned this issue Oct 23, 2023

Llama 2 with LoRA NVIDIA/TensorRT-LLM#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA support #5968

LoRA support #5968

TaQuangTu commented Jun 21, 2023 •

edited

kthui commented Jun 21, 2023

TaQuangTu commented Jun 22, 2023

kthui commented Jun 23, 2023

rmccorm4 commented Jul 10, 2023

tanmayv25 commented Jul 10, 2023 •

edited

foocker commented Sep 18, 2023 •

edited

LoRA support #5968

LoRA support #5968

Comments

TaQuangTu commented Jun 21, 2023 • edited

kthui commented Jun 21, 2023

TaQuangTu commented Jun 22, 2023

kthui commented Jun 23, 2023

rmccorm4 commented Jul 10, 2023

tanmayv25 commented Jul 10, 2023 • edited

foocker commented Sep 18, 2023 • edited

TaQuangTu commented Jun 21, 2023 •

edited

tanmayv25 commented Jul 10, 2023 •

edited

foocker commented Sep 18, 2023 •

edited