Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-LoRA support #1804

Merged
merged 45 commits into from
Jan 23, 2024
Merged

Add multi-LoRA support #1804

merged 45 commits into from
Jan 23, 2024

Conversation

Yard1
Copy link
Collaborator

@Yard1 Yard1 commented Nov 28, 2023

This PR adds support for running multiple LoRA adapters in a single batch in a similar fashion to the S-LoRA/punica projects.

WIP:

  • I want to clean up the code a little more/add some more documentation before merge.
  • I want to add some examples.
  • I need to test more models.

Features:

  • Uses PEFT format for LoRAs
  • Support for every linear layer + embeddings
  • Support for LoRA-added special tokens
  • Combine both LoRA and non-LoRA (base model only) requests in a single batch
  • Tiered Disk (unlimited)->CPU (LRU)->GPU (fixed number of slots) cache
  • Tensor parallelism support
  • Comprehensive testing
  • Support for LoRAs with different ranks (will be made even more performant in the future)
  • Performance oriented implementation that is friendly to torch.compile/CUDA graphs
  • Efficient BGMV kernels from punica (modified and vendored here)
  • A very simple, unfair scheduler policy has been added to allow for max_num_seqs >= max_loras.

Limitations and possible improvements:

  • Currently, only Llama and Mistral models are supported. There is no reason not to support other models, but only those two have been tested.
  • Running LoRAs with a quantized model is untested (may or may not work - need to check).
  • All LoRAs must have the same data type (will be coerced if needed).
  • We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.
  • punica kernels require compute capability >= 8.0
  • punica kernels cause compilation to take much longer. Should be possible to optimize.
  • Maximum supported LoRA rank is 64. This will change with new kernels.
  • Tensor parallelism is not used for sharding LoRA computation (as in S-LoRA paper). This should be trivial to add. Will wait on kernels to be updated first.
  • No changes have been made to the scheduler, meaning that we need to have GPU space for as many LoRAs as there are possible sequences in a batch. This should not be an issue in practice for small batch sizes, but may become problematic for larger ones. I will look into fixing this.
  • Unlike S-LoRA, we do not opt to combine LoRA memory and paged KV cache. Instead, we preallocate fixed LoRA slots. This allows for the design to be simpler, but the S-LoRA design could be applied in the future if needed.
  • The system operates under the assumption that all LoRA files are present on disk (there is no auto-download from S3/Hugging Face hub). This could be implemented outside of vLLM, or added in a follow up.
  • The loading/unloading of LoRA is not overlapped with forward pass, nor are we taking waiting requests for not loaded LoRAs into account. In practice, we have found the impact of that negligible for a reasonable number of LoRAs. However, it would be trivial to add support for that. Left as a follow up.
  • No changes have been made to the OpenAI server/entrypoint. Left as a follow up.

Yard1 and others added 2 commits November 27, 2023 16:45
---------

Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
@tjtanaa
Copy link
Contributor

tjtanaa commented Nov 29, 2023

  • We are using BGMV kernels instead of new SGMV kernels from punica. The BGMV kernel is not efficient for prefill, but the current SGMV CUTLASS-based kernel is not configurable enough and suffers from accuracy drops due to the intermediate output being stored in half-precision. Once punica updates with custom, non-CUTLASS SGMV kernels, I will update the code to make use of them.

The non-CUTLASS SGMV kernels would be very beneficial for any future ROCm support of the kernel. I am looking forward to it.

@sidnb13
Copy link

sidnb13 commented Nov 29, 2023

I'm getting an error RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 0 when initializing an engine for offline batched inference:

model = LLM(
    "mistralai/Mistral-7B-v0.1",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.8,
    enable_lora=True,
)

Am I missing any steps here?

@Yard1
Copy link
Collaborator Author

Yard1 commented Nov 29, 2023

@sidnb13 Thanks for the report! The commit I just pushed (a3f191a) should fix this.

@sidnb13
Copy link

sidnb13 commented Nov 29, 2023

@Yard1 Thanks for fixing! Running into an CUDA OOM error this time with the same code:
image
I'm using a single A100-40GB. I did specify bfloat16 when initializing the engine, so curious why this is the case.

@Yard1
Copy link
Collaborator Author

Yard1 commented Nov 29, 2023

@sidnb13 can you try reducing max_num_seqs when initializing the engine? How many LoRAs are you using here? Does this happen when enable_lora is set to False?

@sidnb13
Copy link

sidnb13 commented Nov 29, 2023

@Yard1 Thanks, was able to get inference working by reducing the default max_num_seqs from 256 to a much smaller number like 32. With enable_lora=False, I can use max_num_seqs=256.

@Yard1
Copy link
Collaborator Author

Yard1 commented Nov 29, 2023

The increased memory usage is expected due to the current design requiring a preallocated LoRA tensor for every possible sequence. I will be looking into removing that requirement soon (so you can have eg. 32 LoRAs but 256 max sequences).

@sidnb13
Copy link

sidnb13 commented Nov 29, 2023

I'm also running into errors installing from source with the latest commit. This happens with both python setup.py install and pip install -e .. Seems like it's originating from compiling the punica CUDA extensions.

@Yard1
Copy link
Collaborator Author

Yard1 commented Nov 29, 2023

@sidnb13 should be good now

@Yard1
Copy link
Collaborator Author

Yard1 commented Nov 29, 2023

Added a very simple scheduler modification to allow for the number of LoRA slots to be smaller than the number of maximum sequences in a batch. Note that the resulting policy is not fair and can lead to starvation of certain LoRAs - it should be improved in the future.

@junior-zsy
Copy link

@WoosukKwon @zhuohan123 @Yard1 Looking forward to merging it into main. I would like to use this feature now. Thank you

@robertgshaw2-neuralmagic
Copy link
Collaborator

Nice job!

@zzizzb
Copy link

zzizzb commented Jan 29, 2024

@cgq0816 哥们,你这个服务部署好了吗?怎么处理的

@cgq0816
Copy link

cgq0816 commented Jan 29, 2024

@cgq0816 哥们,你这个服务部署好了吗?怎么处理的
没有部署好,提示下面的问题:
image

@zzizzb
Copy link

zzizzb commented Jan 31, 2024 via email

@Peilun-Li
Copy link
Contributor

Nice work! Question on this note

The system operates under the assumption that all LoRA files are present on disk (there is no auto-download from S3/Hugging Face hub). This could be implemented outside of vLLM, or added in a follow up.

Does that mean it's not supported yet for LoRA hot swaps / dynamic runtime LoRA adapter load/unload, as seen in lorax. If not, any plan to have it supported? It can be a pretty useful feature to improve model seamlessly and continuously in a production always-on setting. Thanks!

@Yard1
Copy link
Collaborator Author

Yard1 commented Jan 31, 2024

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

@Peilun-Li
Copy link
Contributor

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

Cool yeah I'm envisioning a cyclic model improvement lifecycle, where e.g. every once in a while (day/week/etc.) we can collect model output of LoRA adapter v_x in production, combined with potential human feedback to re-fine-tune a v_{x+1} and hot swap that in to replace v_x. Essentially a time dimension that certain adapter may not exist at server deployment time but to be incorporated at a future runtime. Looks like it's mostly possible just with some peripheral wiring. Thanks for the context!

@keeganmccallum
Copy link

keeganmccallum commented Jan 31, 2024

@Peilun-Li it can hotswap provided all the files are present on disk (we just don't implement the download part, everything else is there)

Cool yeah I'm envisioning a cyclic model improvement lifecycle, where e.g. every once in a while (day/week/etc.) we can collect model output of LoRA adapter v_x in production, combined with potential human feedback to re-fine-tune a v_{x+1} and hot swap that in to replace v_x. Essentially a time dimension that certain adapter may not exist at server deployment time but to be incorporated at a future runtime. Looks like it's mostly possible just with some peripheral wiring. Thanks for the context!

We're a platform for this type of continuous improvement lifecycle (optionally personalized per user) at xler.ai! We'd love to get you access and hear your feedback!

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
@pcmoritz pcmoritz mentioned this pull request Feb 15, 2024
@arshadshk
Copy link

where to find docs for using the hot swaps?

@nootums
Copy link

nootums commented Feb 21, 2024

https://docs.vllm.ai/en/latest/models/lora.html
https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py

@debraj135
Copy link

Should the documentation be updated to specify which architectures are supported for multi lora?

@debate1
Copy link

debate1 commented Feb 25, 2024

@cgq0816 兄弟,你那个报错解决了吗,我跟你碰到相同的问题

@x-transformers
Copy link

#3316 is this normal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet