Allow model to be served under multiple names #2894

hmellor · 2024-02-16T18:19:19Z

This means that you can have more specific model names without requiring users to update their configs whenever you change something that warrants a model name change.

If you passed --served-model-name gpt-4-0613 gpt-4, then your users could make requests to either gpt-4 or gpt-4-0613. The model field of any responses will contain the first model name, gpt-4-0613 in this case, so that a user using gpt-4 knows which version of the model answered their request.

OpenAI calls this Continuous model upgrades.

This meanings that you can have more specific model names without requiring users to update their configs whenever you change something that warrants a model name change. If you passed `--served-model-name gpt-4-0613 gpt-4`, then your users could make requests to either `gpt-4` or `gpt-4-0613`. The `model` field of any responses will contain the first model name`gpt-4-0613` in this case, so that a user using `gpt-4` knows which version of the model answered their request. OpenAI calls this [Continuous model upgrades](https://platform.openai.com/docs/models/continuous-model-upgrades).

esmeetu · 2024-02-22T00:42:55Z

Hi, @hmellor Thanks for proposing this idea. But i have a different opinion.
Regarding OpenAI, there should be a Router or Gateway at the top layer of the model service to point to different underlying model services and switch freely. Currently, I believe vLLM is an atomic service at the bottom layer, where model names and model files correspond one-to-one. Renaming might lead to confusion. For instance, gpt-4 and gpt-4-0125 are definitely different model services. If I want to update the model service today, I would deploy a new microservice, like gpt-4-0222, and then switch gpt-4 to this underlying microservice on the Router, instead of redeploying the old one. Therefore, I suggest treating vLLM as a microservice when using it would be preferable.
In the end, i would like to keep current openai http server smaller and simpler and only support atomic features.
@WoosukKwon @zhuohan123 @simon-mo

simon-mo · 2024-02-22T21:40:17Z

I agree this is out of scope for vLLM model server. It is the responsibility of a router. Is there concrete models that you have in mind that needs this or custom model?

hmellor · 2024-02-22T21:48:13Z

My use case was that I wanted to be able to change models without my users needing to change their configs. For example if I were to use --served-model-name codellama default, my users could set their configs to default and not need to change anything if I ever changed to a different model, wizardcoder for example.

You're right that this could (and likely should) be done by the router. It just seemed like a small change and a nice QoL feature. But, if it's out of scope I won't fight to get it in.

simon-mo · 2024-02-22T22:32:36Z

I understand the use case. I'm fine with multiple names. But

The model in returned response should be the model requested by user, not the first served model name
The list model calls with ModelCards should enumerate all names, not the first one.

hmellor · 2024-02-22T23:37:50Z

The model in returned response should be the model requested by user, not the first served model name

The first served model name is returned because it is meant to be the most specific. If a user queries default, it is useful for them to know which model responded, even though they weren't requesting a specific model. This is what the OpenAI API does.

The list model calls with ModelCards should enumerate all names, not the first one.

For non-LoRA models, it does enumerate all names.
For LoRA models, we only added the first served model name as the root of all LoRA models.

Would you prefer it if for the lora models we return all combinations of self.served_model_names and self.lora_requests?

simon-mo · 2024-02-22T23:39:46Z

If a user queries default, it is useful for them to know which model responded, even though they weren't requesting a specific model. This is what the OpenAI API does.

Ah this make sense.

For LoRA models, we only added the first served model name as the root of all LoRA models.
Would you prefer it if for the lora models we return all combinations of self.served_model_names and self.lora_requests?

@Yard1 wdyt?

hmellor · 2024-03-06T17:55:36Z

I think it makes sense to leave it as is.

If request.model contains a LoRA model, then the engine will fetch the LoRA model using:

        for lora in self.lora_requests:
            if request.model == lora.lora_name:
                return lora

It makes sense for the root of all of those LoRA models to be the most detailed version of the possible aliases in self.served_model_names (i.e. the first one)

hmellor · 2024-03-19T12:42:05Z

I've changed the model_cards list to have self.served_model_names[0] as their root to match the way the ModelCards for lora_cards works.

So a user calling /v1/models knows that the model called default comes from codellama (to reuse my example from an earlier comment).

AaronFriel · 2024-04-06T17:07:50Z

@simon-mo Strong support for merging this to make model upgrades easier in IaC scenarios.

simon-mo · 2024-04-06T17:25:48Z

Ok let's get this PR in.

njhill

Thanks @hmellor, I'm also in favour of this change.

Do these changes allow configuring such that the model name isn't required at all in the API (as requested in #1478)? Since the server currently will only ever serve one "base" model, there's technically no need to include this API field, unless you're using a lora adapter.

vllm/entrypoints/openai/cli_args.py

hmellor · 2024-04-08T08:41:45Z

Do these changes allow configuring such that the model name isn't required at all in the API (as requested in #1478)? Since the server currently will only ever serve one "base" model, there's technically no need to include this API field, unless you're using a lora adapter.

No, it doesn't. A PR that removes the need for model name in requests was already rejected in #1541.

AaronFriel · 2024-04-08T16:56:36Z

If --served-model-name is "", is it functionally equivalent?

hmellor · 2024-04-08T17:02:52Z

Potentially actually, @samos123 would this solve your issue?

simon-mo · 2024-04-18T07:16:37Z

sorry about the delay, merged.

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

samos123 · 2024-04-30T16:46:03Z

This doesn't really solve my issue, I would simply want the check to be removed as an option. I don't want vLLM to handle this and prefer to let it accept any request it takes since in my use case vLLM only ever serves 1 model.

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

yhyu13 · 2024-06-01T06:01:29Z

Is this feature available in any vllm version?

I failed on vllm 0.4.1 with

--served-model-name  "gpt-3.5-turbo gpt-4 gpt-4o"

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

hmellor and others added 4 commits February 16, 2024 18:17

Merge main and resolved conflict with LoRA adapter PR

e481d40

Simplify passing model_name to responses

d166654

Merge branch 'main' into served-model-name-aliases

33e3bbf

Merge branch 'main' into served-model-name-aliases

723a358

Merge branch 'main' into served-model-name-aliases

1aa9668

hmellor added 2 commits March 19, 2024 12:31

Merge branch 'main' into served-model-name-aliases

43677b0

Make root of all non-lora models the most detailed name too

e1c976b

hmellor requested a review from simon-mo March 19, 2024 12:42

./format.sh

390194c

hmellor mentioned this pull request Apr 6, 2024

OpenAI API model_check should be optional or removed #1478

Closed

Merge branch 'main' into served-model-name-aliases

9bbe910

njhill reviewed Apr 8, 2024

View reviewed changes

vllm/entrypoints/openai/cli_args.py Outdated Show resolved Hide resolved

Update vllm/entrypoints/openai/cli_args.py

439a7f1

simon-mo approved these changes Apr 18, 2024

View reviewed changes

simon-mo merged commit 66ded03 into vllm-project:main Apr 18, 2024
35 checks passed

hmellor deleted the served-model-name-aliases branch April 18, 2024 22:37

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024

Allow model to be served under multiple names (vllm-project#2894)

3f5184b

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

Allow model to be served under multiple names (vllm-project#2894)

592fc55

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

Allow model to be served under multiple names (vllm-project#2894)

05086c1

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

Allow model to be served under multiple names (vllm-project#2894)

b1f33a2

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

Allow model to be served under multiple names (vllm-project#2894)

a90ddf7

Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow model to be served under multiple names #2894

Allow model to be served under multiple names #2894

hmellor commented Feb 16, 2024 •

edited

esmeetu commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024 •

edited

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Mar 6, 2024

hmellor commented Mar 19, 2024

AaronFriel commented Apr 6, 2024

simon-mo commented Apr 6, 2024

njhill left a comment

hmellor commented Apr 8, 2024

AaronFriel commented Apr 8, 2024

hmellor commented Apr 8, 2024

simon-mo commented Apr 18, 2024

samos123 commented Apr 30, 2024

yhyu13 commented Jun 1, 2024

Allow model to be served under multiple names #2894

Allow model to be served under multiple names #2894

Conversation

hmellor commented Feb 16, 2024 • edited

esmeetu commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024 • edited

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Mar 6, 2024

hmellor commented Mar 19, 2024

AaronFriel commented Apr 6, 2024

simon-mo commented Apr 6, 2024

njhill left a comment

Choose a reason for hiding this comment

hmellor commented Apr 8, 2024

AaronFriel commented Apr 8, 2024

hmellor commented Apr 8, 2024

simon-mo commented Apr 18, 2024

samos123 commented Apr 30, 2024

yhyu13 commented Jun 1, 2024

hmellor commented Feb 16, 2024 •

edited

hmellor commented Feb 22, 2024 •

edited