tracking torch.compile compatibility with lora serving

### Your current environment

N/A

### Model Input Dumps

_No response_

### 🐛 Describe the bug

Using `torch.compile` with lora with fail, because vLLM's support for multi-lora (punica kernel) is very complicated.

The punica wrapper defined in https://github.com/vllm-project/vllm/blob/571841b7fcc67f8b1d171522f6249ed4224033e1/vllm/lora/punica.py#L179

is very similar to attention ops. If we want to support `torch.compile` for it, we need to do something similar to https://github.com/vllm-project/vllm/pull/10558 , i.e. hiding the whole punica operation from `torch.compile` .

The difference is, attention ops have quite uniform signature, and we only need to register it once; while punica ops have several signatures, and are applied to various layers, including linear / embedding etc. Even if we wrap all ops into pytorch custom ops for `torch.compile`, there's not much left for `torch.compile` to accelerate.

Therefore, I tend to leave lora as-is, and just ignore `torch.compile` for it.

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tracking torch.compile compatibility with lora serving #10617

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

tracking torch.compile compatibility with lora serving #10617

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions