-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Your current environment
N/A
Model Input Dumps
No response
🐛 Describe the bug
Using torch.compile
with lora with fail, because vLLM's support for multi-lora (punica kernel) is very complicated.
The punica wrapper defined in
Line 179 in 571841b
class PunicaWrapper: |
is very similar to attention ops. If we want to support torch.compile
for it, we need to do something similar to #10558 , i.e. hiding the whole punica operation from torch.compile
.
The difference is, attention ops have quite uniform signature, and we only need to register it once; while punica ops have several signatures, and are applied to various layers, including linear / embedding etc. Even if we wrap all ops into pytorch custom ops for torch.compile
, there's not much left for torch.compile
to accelerate.
Therefore, I tend to leave lora as-is, and just ignore torch.compile
for it.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.