Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow model to be served under multiple names #2894

Merged
merged 11 commits into from
Apr 18, 2024

Conversation

hmellor
Copy link
Collaborator

@hmellor hmellor commented Feb 16, 2024

This means that you can have more specific model names without requiring users to update their configs whenever you change something that warrants a model name change.

If you passed --served-model-name gpt-4-0613 gpt-4, then your users could make requests to either gpt-4 or gpt-4-0613. The model field of any responses will contain the first model name, gpt-4-0613 in this case, so that a user using gpt-4 knows which version of the model answered their request.

OpenAI calls this Continuous model upgrades.

hmellor and others added 4 commits February 16, 2024 18:17
This meanings that you can have more specific model names without requiring users to update their configs whenever you change something that warrants a model name change.

If you passed `--served-model-name gpt-4-0613 gpt-4`, then your users could make requests to either `gpt-4` or `gpt-4-0613`. The `model` field of any responses will contain the first model name`gpt-4-0613` in this case, so that a user using `gpt-4` knows which version of the model answered their request.

OpenAI calls this [Continuous model upgrades](https://platform.openai.com/docs/models/continuous-model-upgrades).
@esmeetu
Copy link
Collaborator

esmeetu commented Feb 22, 2024

Hi, @hmellor Thanks for proposing this idea. But i have a different opinion.
Regarding OpenAI, there should be a Router or Gateway at the top layer of the model service to point to different underlying model services and switch freely. Currently, I believe vLLM is an atomic service at the bottom layer, where model names and model files correspond one-to-one. Renaming might lead to confusion. For instance, gpt-4 and gpt-4-0125 are definitely different model services. If I want to update the model service today, I would deploy a new microservice, like gpt-4-0222, and then switch gpt-4 to this underlying microservice on the Router, instead of redeploying the old one. Therefore, I suggest treating vLLM as a microservice when using it would be preferable.
In the end, i would like to keep current openai http server smaller and simpler and only support atomic features.
@WoosukKwon @zhuohan123 @simon-mo

@simon-mo
Copy link
Collaborator

I agree this is out of scope for vLLM model server. It is the responsibility of a router. Is there concrete models that you have in mind that needs this or custom model?

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 22, 2024

My use case was that I wanted to be able to change models without my users needing to change their configs. For example if I were to use --served-model-name codellama default, my users could set their configs to default and not need to change anything if I ever changed to a different model, wizardcoder for example.

You're right that this could (and likely should) be done by the router. It just seemed like a small change and a nice QoL feature. But, if it's out of scope I won't fight to get it in.

@simon-mo
Copy link
Collaborator

I understand the use case. I'm fine with multiple names. But

  • The model in returned response should be the model requested by user, not the first served model name
  • The list model calls with ModelCards should enumerate all names, not the first one.

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 22, 2024

  • The model in returned response should be the model requested by user, not the first served model name

The first served model name is returned because it is meant to be the most specific. If a user queries default, it is useful for them to know which model responded, even though they weren't requesting a specific model. This is what the OpenAI API does.

  • The list model calls with ModelCards should enumerate all names, not the first one.
  • For non-LoRA models, it does enumerate all names.
  • For LoRA models, we only added the first served model name as the root of all LoRA models.

Would you prefer it if for the lora models we return all combinations of self.served_model_names and self.lora_requests?

@simon-mo
Copy link
Collaborator

If a user queries default, it is useful for them to know which model responded, even though they weren't requesting a specific model. This is what the OpenAI API does.

Ah this make sense.

For LoRA models, we only added the first served model name as the root of all LoRA models.
Would you prefer it if for the lora models we return all combinations of self.served_model_names and self.lora_requests?

@Yard1 wdyt?

@hmellor
Copy link
Collaborator Author

hmellor commented Mar 6, 2024

I think it makes sense to leave it as is.

If request.model contains a LoRA model, then the engine will fetch the LoRA model using:

        for lora in self.lora_requests:
            if request.model == lora.lora_name:
                return lora

It makes sense for the root of all of those LoRA models to be the most detailed version of the possible aliases in self.served_model_names (i.e. the first one)

@hmellor
Copy link
Collaborator Author

hmellor commented Mar 19, 2024

I've changed the model_cards list to have self.served_model_names[0] as their root to match the way the ModelCards for lora_cards works.

So a user calling /v1/models knows that the model called default comes from codellama (to reuse my example from an earlier comment).

@hmellor hmellor requested a review from simon-mo March 19, 2024 12:42
@AaronFriel
Copy link

@simon-mo Strong support for merging this to make model upgrades easier in IaC scenarios.

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 6, 2024

Ok let's get this PR in.

Copy link
Collaborator

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hmellor, I'm also in favour of this change.

Do these changes allow configuring such that the model name isn't required at all in the API (as requested in #1478)? Since the server currently will only ever serve one "base" model, there's technically no need to include this API field, unless you're using a lora adapter.

vllm/entrypoints/openai/cli_args.py Outdated Show resolved Hide resolved
@hmellor
Copy link
Collaborator Author

hmellor commented Apr 8, 2024

Do these changes allow configuring such that the model name isn't required at all in the API (as requested in #1478)? Since the server currently will only ever serve one "base" model, there's technically no need to include this API field, unless you're using a lora adapter.

No, it doesn't. A PR that removes the need for model name in requests was already rejected in #1541.

@AaronFriel
Copy link

If --served-model-name is "", is it functionally equivalent?

@hmellor
Copy link
Collaborator Author

hmellor commented Apr 8, 2024

Potentially actually, @samos123 would this solve your issue?

@simon-mo simon-mo merged commit 66ded03 into vllm-project:main Apr 18, 2024
35 checks passed
@simon-mo
Copy link
Collaborator

sorry about the delay, merged.

@hmellor hmellor deleted the served-model-name-aliases branch April 18, 2024 22:37
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 21, 2024
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
@samos123
Copy link

This doesn't really solve my issue, I would simply want the check to be removed as an option. I don't want vLLM to handle this and prefer to let it accept any request it takes since in my use case vLLM only ever serves 1 model.

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
@yhyu13
Copy link

yhyu13 commented Jun 1, 2024

Is this feature available in any vllm version?

I failed on vllm 0.4.1 with

--served-model-name  "gpt-3.5-turbo gpt-4 gpt-4o"

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants