Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA support #5968

Open
TaQuangTu opened this issue Jun 21, 2023 · 6 comments
Open

LoRA support #5968

TaQuangTu opened this issue Jun 21, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@TaQuangTu
Copy link

TaQuangTu commented Jun 21, 2023

Is your feature request related to a problem? Please describe.
This is a feature related to how to deploy a model with LoRA supported.

Describe the solution you'd like
I have a UNet model deployed on Triton Inference Server with TensorRT backend. I also have tens of LoRA weights (in Torch format) to be applied on the Unet model.

For each LoRA weight, I manually clone a UNet model (in Torch format), merge the LoRA weight into the UNet by formula:
W_new = W_unet + BA where A and B is weight matrices in the given LoRA weight.

And then, manually convert the merged model to TensorRT and serve on Triton Inference Server.

With Triton, I would like to have an API/function to dynamically modify intrinsic values of running models, in this case UNet.

Additional context
I think this is a well explained article to get to know LoRA, hope it help: https://lightning.ai/pages/community/tutorial/lora-llm/

@kthui
Copy link
Contributor

kthui commented Jun 21, 2023

Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5053

@kthui kthui added the enhancement New feature or request label Jun 21, 2023
@TaQuangTu
Copy link
Author

@kthui Thank you, it would be great if you could send us back a rough time estimate for completing it soon.

@kthui
Copy link
Contributor

kthui commented Jun 23, 2023

cc @Christina-Young-NVIDIA for the time estimate.

@rmccorm4
Copy link
Collaborator

@tanmayv25 I think there was a similar request in the past to support TRT's refit API, but we ultimately went with just reloading the model, right?

I think the same applies here, the model should just be reloaded, which there are APIs for.

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jul 10, 2023

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency.
Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

@foocker
Copy link

foocker commented Sep 18, 2023

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency. Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

so, the best way is: transform the merged weights to trt, and the do as the usual way on triton?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

5 participants