New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA support #5968
Comments
Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5053 |
@kthui Thank you, it would be great if you could send us back a rough time estimate for completing it soon. |
cc @Christina-Young-NVIDIA for the time estimate. |
@tanmayv25 I think there was a similar request in the past to support TRT's refit API, but we ultimately went with just reloading the model, right? I think the same applies here, the model should just be reloaded, which there are APIs for. |
TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency. Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized. |
so, the best way is: transform the merged weights to trt, and the do as the usual way on triton? |
Is your feature request related to a problem? Please describe.
This is a feature related to how to deploy a model with LoRA supported.
Describe the solution you'd like
I have a UNet model deployed on Triton Inference Server with TensorRT backend. I also have tens of LoRA weights (in Torch format) to be applied on the Unet model.
For each LoRA weight, I manually clone a UNet model (in Torch format), merge the LoRA weight into the UNet by formula:
W_new = W_unet + BA
where A and B is weight matrices in the given LoRA weight.And then, manually convert the merged model to TensorRT and serve on Triton Inference Server.
With Triton, I would like to have an API/function to dynamically modify intrinsic values of running models, in this case UNet.
Additional context
I think this is a well explained article to get to know LoRA, hope it help: https://lightning.ai/pages/community/tutorial/lora-llm/
The text was updated successfully, but these errors were encountered: