Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-LoRA as extra models in OpenAI server #2775

Merged
merged 31 commits into from
Feb 17, 2024
Merged

Conversation

jvmncs
Copy link
Contributor

@jvmncs jvmncs commented Feb 5, 2024

closes #2600

how to serve the loras (mimicking the multilora inference example):

$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH

the above server will list 3 separate values if the user queries /models: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models

@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 5, 2024

was planning to add a test in tests/entrypoints/test_openai_server.py, but it's unclear which loras are usable with the existing zephyr model from that file. happy to change that out with some different model/lora defaults from huggingface, if that's desirable

@simon-mo
Copy link
Collaborator

simon-mo commented Feb 5, 2024

@jvmncs, thanks for the PR. I think you can change the base model to other that works. Any suggestions? If not we can also create a new test file just to test LoRA support.

@simon-mo simon-mo self-assigned this Feb 5, 2024
@Yard1 Yard1 self-assigned this Feb 5, 2024
@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 5, 2024

this one looks good: https://huggingface.co/typeof/zephyr-7b-beta-lora

)
async def test_single_completion(server, client: openai.AsyncOpenAI,
model_name: str):
completion = await client.completions.create(model=model_name,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason this test was failing for all cases when I switched to MODEL_NAME="mistralai/Mistral-7B-v0.1". model was consistently emitting "1999" no matter the prompt/temperature I tried. not really sure why that's the case but reverting to the zephyr model fixed it

@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 6, 2024

latest commit should be ready for a review, assuming CI passes as it did on my machine

@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 12, 2024

@simon-mo @Yard1 bumping this for review

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, could we add an example of starting an API server with loras?

@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 12, 2024

could we add an example of starting an API server with loras

sure @Yard1, what kind of example were you thinking? other than the command snippet in my original comment, I'm not sure what that would look like

@Yard1
Copy link
Collaborator

Yard1 commented Feb 13, 2024

Hmm, I guess we don't really have an example where we start the server - in that case just extending the docs (https://github.com/vllm-project/vllm/blob/main/docs/source/models/lora.rst) with a snippet of how to run the OpenAI server with lora should be enough!

hongxiayang and others added 15 commits February 14, 2024 11:21
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
@jvmncs
Copy link
Contributor Author

jvmncs commented Feb 15, 2024

@Yard1 should be good to go

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

@Yard1 Yard1 merged commit 8f36444 into vllm-project:main Feb 17, 2024
19 checks passed
@jvmncs jvmncs deleted the openai-lora branch February 17, 2024 23:03
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
@Wizmak9
Copy link

Wizmak9 commented Feb 29, 2024

@Yard1 can we do hot swap of lora like without restarting the base model again and changing lora adapters to that base model on the fly.

@ajaychinni
Copy link

@Wizmak9 You must initiate your VLLM only using the OpenAI server, following the guidelines provided in the doumnetation

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ sql-lora2=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test2/

In the above command just for example loaded two lora modules sql-lora and sql-lora2

Now, when making a request, you can specify which LoRA you wish to use. In subsequent requests, you can select different LoRA adapters on the fly, without needing to restart the model.

Python code to simulate the request using sql-lora. You can comment out this model and use the base model or another LoRA model in subsequent requests:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    # model =  "meta-llama/Llama-2-7b-hf ",
    model = "sql-lora" ,
    # model = "sql-lora2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport "},
    ]
)
print("Chat response:", chat_response)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate multi-LoRA functionality with OpenAI server