Add guided decoding for OpenAI API server #2819

felixzhu555 · 2024-02-08T21:40:01Z

Support guided decoding (JSON, regex, choice) using outlines for the completion and chat completion OpenAI endpoints.
This is a continuation of @br3no's work in #2815.

relevant: #288

…lines_ (https://github.com/outlines-dev/outlines).

jalotra · 2024-02-09T05:22:56Z

@felixzhu555 can you please add example on how the guidedJson or guidedRegex should look like :
example, if the output of the vLLM has to be :

{
    "name" : "jalotra",
    "country" : "india"
}

felixzhu555 · 2024-02-09T05:52:12Z

Hi @jalotra this feature is still in development, but once it's added I'd imagine you would define the JSON or regex as a python dictionary or string or pydantic BaseModel class. You can see this simple example that uses pydantic here: https://github.com/outlines-dev/outlines/blob/main/examples/vllm_integration.py

For your example, you would pass that JSON into a request to the vllm openai server through the extra_body parameter:
extra_body=dict(guided_regex={ "name" : "jalotra", "country" : "india" })

br3no · 2024-02-09T07:38:36Z

@felixzhu555 great to see this being picked-up so quickly!

Don't you think the guided decoding in the OpenAI API server should mimic the OpenAI way of having guided decoding?

Now that outlines is integrated, it becomes easier to support e.g. the tools parameter in the chat completion API.

Take this example request from the OpenAI API reference docs (see https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools):

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

We could extract the json schema from parameters and inject a prompt telling the model to generate JSON in this format and at the same time use outlines to enforce the response to match the specification.

br3no · 2024-02-09T07:43:46Z

Supporting tools and tool_choice would solve #1869.

felixzhu555 · 2024-02-09T21:52:51Z

Hey @br3no, thanks for the suggestion, I'll definitely look into supporting the tools parameter later on.

ibeltagy · 2024-02-10T16:19:41Z

I gave this a try and it worked. Thank you

There's a bug in the dummy_llm; it failed to find an LLM in the path dummy. Here's a simple fix that doesn't need an LLM object because outlines only uses the tokenizer

    def dummy_llm():
        import types
        x = types.SimpleNamespace()
        y = types.SimpleNamespace()
        x.tokenizer = tokenizer
        y.tokenizer = x
        return y

ibeltagy · 2024-02-11T03:38:00Z

@br3no, are you suggesting to remove the proposed guided_json=schema option and only support tools?

It seems to me that it doesn't have to be one or the other. OpenAI API doesn't allow a custom json schema for constrained decoding, so this is a new capability that's doesn't fit the exiting API. You are right that the guided json can be used to power tools, but that doesn't necessarily have to replace guided_json=schema.

br3no · 2024-02-11T13:48:44Z

@ibeltagy vLLM offers two server implementations, one of which mimics the OpenAI API. The nice thing about offering an OpenAI-compatible server is that vLLM works as a drop-in replacement for OpenAI. All the stuff that works with OpenAI simply works with vLLM. As an example, take this project here: https://github.com/jxnl/instructor. (I’m not affiliated, nor do I recommend or know much about this project; it’s just an example)

If you start diverging from the OpenAI API, this can break.

At the same time, if users want to use features the OpenAI API doesn’t offer, they are free to use the vLLM-own server implementation, where the vLLM maintainers and community are free to go beyond what OpenAI offers.

While the OpenAI API doesn’t offer an explicit json schema option, the tools functionality offers a way to do exactly that.

So yes, I don’t see a good argument for adding these parameters to the OpenAI server because:

You can use the json schema functionality through the tools parameters
The risk of breaking compatibility gets higher and the cost of maintenance for this server implementation increases
You are free to use the vLLM-own server if you want, having access to json schema and regex-guided decoding.

ibeltagy · 2024-02-11T16:43:57Z

@br3no, makes sense

felixzhu555 · 2024-02-13T02:47:10Z

@br3no @ibeltagy the tool parameter is being added through #2488, this PR will only add guided decoding features.

simon-mo

A biiiig problem here is that we are creating new logits processor per request. A common scenarios will be handling for common schemas/constraints. Can you think of a good (and semantically) way to cache the logits processors or the FSM in outlines?

requirements.txt

tests/entrypoints/test_openai_server_guided_decoding.py

vllm/entrypoints/api_server.py

vllm/entrypoints/openai/serving_chat.py

vllm/entrypoints/openai/serving_completion.py

vllm/entrypoints/openai/serving_chat.py

vllm/model_executor/guided_decoding.py

simon-mo · 2024-02-14T06:08:02Z

I think module is actually the right one. Otherwise it needs to be in conftest for uniqueness

On February 13, 2024, GitHub ***@***.***> wrote: @felixzhu555 commented on this pull request. On tests/entrypoints/test_openai_server_guided_decoding.py <https://github.com/vllm- project/vllm/pull/2819#discussion_r1488918490>: Ok, should the scope for fixtures be reverted to "session"?

— Reply to this email directly, view it on GitHub <#2819 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe- auth/AFBD7A2E4CNQHTPJWCDIZRDYTRFNXAVCNFSM6AAAAABDAPCGJ6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZZGQYDSNJYHE>. You are receiving this because you were assigned.Message ID: <vllm- ***@***.***>

vllm/model_executor/guided_decoding.py

…ed_gen

Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>

arshadshk · 2024-03-12T16:22:17Z

Where can I find documentation for using guided decoding?

felixzhu555 · 2024-03-12T19:35:41Z

Hi @arshadshk, sorry we don't have written documentation for guided decoding just yet, I'll try to add that soon. If you have a specific use case I can try to explain how to use it, otherwise you can check out the guided decoding tests here for some examples.

BedirT · 2024-03-20T22:12:41Z

Is this strictly an OpenAI-compatible server feature? I don't see any mention on having this available as part of simple vllm generation calls? If so what is the reason for it?

simon-mo · 2024-03-20T22:15:25Z

Currently yes. But it would be valuable to have similar API is LLM class. I will open a new issue to have folks working on it.

simon-mo · 2024-03-20T22:19:53Z

@BedirT issue created #3536. I will have someone working on it.

ProVega · 2024-04-10T03:31:42Z

Where can I find documentation for this feature?

BedirT · 2024-04-10T04:27:53Z

I don't think there is a lengthy documentation on it, but it is shortly mentioned here: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-completions-api

ProVega · 2024-04-10T15:58:27Z

That same states the parameters, but doesn't provide any actual samples of JSON SCHEMA. I have found a few others, but I can't find anything that shows how to return an ARRAY of JSON objects. Example: "Find me 5 companies that sell cars" and I get back { Results: [ { name: "Ford" }, {name: "Toyota"}, {name: "BMW}]

AaronFriel · 2024-04-10T17:59:55Z

@ProVega Using tools like pydantic or zod-to-json-schema are helpful here. I think the schema you want is, for just an array of names:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" }
    }
  }
}

Or with a top level Results key:

{
  "type": "object",
  "properties": {
    "Results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" }
        }
      }
    }
  }
}

sam1am · 2024-04-30T16:44:45Z

This took me some tinkering to figure out, given all the different methods and tools that have been discussed regarding this issue. Here is a simple example that will produce guided json output for me consistently:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
)

json_schema = {
    "type": "object",
    "properties": {
        "thought": {"type": "string"},
        "answer": {"type": "string"}
    },
    "required": ["thought", "answer"]
}

query = "What is the capital of France?"

# Feed the random question into the existing query
system_prompt = "Respond only with a json object containing the following fields and nothing else: thought, answer."
completion = client.chat.completions.create(
    model="casperhansen/llama-3-8b-instruct-awq",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ],
    extra_body={
        "stop_token_ids": [128009], 
        "response_format": {"type": "json_object"},
        "guided_json": json_schema
    }
)

print(completion.choices[0].message.content.strip())

Output:

{
"answer": "Paris",
"thought": "Bonjour!" 
}

The only issue is the order, which appears to be alphabetical regardless of how I write the code. This is important for tree of thought-type prompting.

felixzhu555 and others added 3 commits February 8, 2024 02:29

first pass for JSON and regex

cd3b3bb

tiny refactor

e589bd0

Added support for guided decoding in api_server by integrating _out…

5f55e6a

…lines_ (https://github.com/outlines-dev/outlines).

felixzhu555 changed the title ~~Add structured gen~~ Add structured generation for OpenAI server Feb 8, 2024

felixzhu555 changed the title ~~Add structured generation for OpenAI server~~ Add structured generation for OpenAI API server Feb 8, 2024

simon-mo self-assigned this Feb 8, 2024

felixzhu555 changed the title ~~Add structured generation for OpenAI API server~~ Add guided decoding for OpenAI API server Feb 8, 2024

felixzhu555 added 3 commits February 8, 2024 17:39

refactor/combine breno's PR with mine

3a051cf

fix type check

54217ba

fix try-except

c9c6f4f

fix import bug

b82dedb

add outlines v0.0.27 requirement

ba92cb2

felixzhu555 added 2 commits February 10, 2024 15:48

fix dummy_llm

da2f5b8

start adding tests

b090c18

add more tests

9093c5e

fix pytest fixtures scope

1efd64d

simon-mo reviewed Feb 13, 2024

View reviewed changes

remove guided decoding from vllm api server

736ca31

simon-mo mentioned this pull request Feb 13, 2024

Add support for guided decoding (fixes #288) #2815

Closed

fix global pool

9655b6e

simon-mo mentioned this pull request Feb 28, 2024

Use of logits_processors has become very slow in v0.3.2 #3087

Closed

simon-mo reviewed Feb 28, 2024

View reviewed changes

vllm/model_executor/guided_decoding.py Outdated Show resolved Hide resolved

Apply suggestions from code review

0c3d475

simon-mo mentioned this pull request Feb 28, 2024

[v0.3.3] Release Tracker #3097

Closed

5 tasks

simon-mo enabled auto-merge (squash) February 29, 2024 06:00

simon-mo added 2 commits February 29, 2024 19:24

Merge branch 'main' of github.com:vllm-project/vllm into add_structur…

0793fc7

…ed_gen

some minor fix

ce9c07a

simon-mo merged commit 703e42e into vllm-project:main Feb 29, 2024
22 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Add guided decoding for OpenAI API server (vllm-project#2819)

5b6ddcb

Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>

hmellor mentioned this pull request Mar 8, 2024

Support Multi-choice Tasks (Manipulate the completion tokens) #520

Closed

felixzhu555 deleted the add_structured_gen branch March 12, 2024 19:55

simon-mo mentioned this pull request Mar 20, 2024

[Feature]: Support Guided Decoding in LLM entrypoint #3536

Open

This was referenced Mar 25, 2024

Feature request: add json formating via opanai entrypoint #2204

Closed

JSON formatting issue #1191

Closed

hmellor mentioned this pull request Apr 4, 2024

Add Support for Llama-cpp-python Grammar #2548

Closed

dmarasco mentioned this pull request Apr 9, 2024

[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server #3945

Merged

QwertyJack mentioned this pull request May 27, 2024

[Feature] Guided Decoding InternLM/lmdeploy#1664

Closed

sunxichen mentioned this pull request May 30, 2024

Integration of Guided Generation Features from vllm into XInference xorbitsai/inference#1562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add guided decoding for OpenAI API server #2819

Add guided decoding for OpenAI API server #2819

felixzhu555 commented Feb 8, 2024 •

edited

Loading

jalotra commented Feb 9, 2024

felixzhu555 commented Feb 9, 2024 •

edited

Loading

br3no commented Feb 9, 2024

br3no commented Feb 9, 2024

felixzhu555 commented Feb 9, 2024

ibeltagy commented Feb 10, 2024 •

edited

Loading

ibeltagy commented Feb 11, 2024

br3no commented Feb 11, 2024

ibeltagy commented Feb 11, 2024

felixzhu555 commented Feb 13, 2024

simon-mo left a comment

simon-mo commented Feb 14, 2024 via email

arshadshk commented Mar 12, 2024

felixzhu555 commented Mar 12, 2024

BedirT commented Mar 20, 2024

simon-mo commented Mar 20, 2024

simon-mo commented Mar 20, 2024

ProVega commented Apr 10, 2024

BedirT commented Apr 10, 2024

ProVega commented Apr 10, 2024

AaronFriel commented Apr 10, 2024 •

edited

Loading

sam1am commented Apr 30, 2024 •

edited

Loading

Add guided decoding for OpenAI API server #2819

Add guided decoding for OpenAI API server #2819

Conversation

felixzhu555 commented Feb 8, 2024 • edited Loading

jalotra commented Feb 9, 2024

felixzhu555 commented Feb 9, 2024 • edited Loading

br3no commented Feb 9, 2024

br3no commented Feb 9, 2024

felixzhu555 commented Feb 9, 2024

ibeltagy commented Feb 10, 2024 • edited Loading

ibeltagy commented Feb 11, 2024

br3no commented Feb 11, 2024

ibeltagy commented Feb 11, 2024

felixzhu555 commented Feb 13, 2024

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo commented Feb 14, 2024 via email

arshadshk commented Mar 12, 2024

felixzhu555 commented Mar 12, 2024

BedirT commented Mar 20, 2024

simon-mo commented Mar 20, 2024

simon-mo commented Mar 20, 2024

ProVega commented Apr 10, 2024

BedirT commented Apr 10, 2024

ProVega commented Apr 10, 2024

AaronFriel commented Apr 10, 2024 • edited Loading

sam1am commented Apr 30, 2024 • edited Loading

felixzhu555 commented Feb 8, 2024 •

edited

Loading

felixzhu555 commented Feb 9, 2024 •

edited

Loading

ibeltagy commented Feb 10, 2024 •

edited

Loading

AaronFriel commented Apr 10, 2024 •

edited

Loading

sam1am commented Apr 30, 2024 •

edited

Loading