Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guided decoding for OpenAI API server #2819

Merged
merged 35 commits into from
Feb 29, 2024

Conversation

felixzhu555
Copy link
Contributor

@felixzhu555 felixzhu555 commented Feb 8, 2024

Support guided decoding (JSON, regex, choice) using outlines for the completion and chat completion OpenAI endpoints.
This is a continuation of @br3no's work in #2815.

relevant: #288

@felixzhu555 felixzhu555 changed the title Add structured gen Add structured generation for OpenAI server Feb 8, 2024
@felixzhu555 felixzhu555 changed the title Add structured generation for OpenAI server Add structured generation for OpenAI API server Feb 8, 2024
@simon-mo simon-mo self-assigned this Feb 8, 2024
@felixzhu555 felixzhu555 changed the title Add structured generation for OpenAI API server Add guided decoding for OpenAI API server Feb 8, 2024
@jalotra
Copy link

jalotra commented Feb 9, 2024

@felixzhu555 can you please add example on how the guidedJson or guidedRegex should look like :
example, if the output of the vLLM has to be :

{
    "name" : "jalotra",
    "country" : "india"
}

@felixzhu555
Copy link
Contributor Author

felixzhu555 commented Feb 9, 2024

Hi @jalotra this feature is still in development, but once it's added I'd imagine you would define the JSON or regex as a python dictionary or string or pydantic BaseModel class. You can see this simple example that uses pydantic here: https://github.com/outlines-dev/outlines/blob/main/examples/vllm_integration.py

For your example, you would pass that JSON into a request to the vllm openai server through the extra_body parameter:
extra_body=dict(guided_regex={ "name" : "jalotra", "country" : "india" })

@br3no
Copy link
Contributor

br3no commented Feb 9, 2024

@felixzhu555 great to see this being picked-up so quickly!

Don't you think the guided decoding in the OpenAI API server should mimic the OpenAI way of having guided decoding?

Now that outlines is integrated, it becomes easier to support e.g. the tools parameter in the chat completion API.

Take this example request from the OpenAI API reference docs (see https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools):

curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}'

We could extract the json schema from parameters and inject a prompt telling the model to generate JSON in this format and at the same time use outlines to enforce the response to match the specification.

@br3no
Copy link
Contributor

br3no commented Feb 9, 2024

Supporting tools and tool_choice would solve #1869.

@felixzhu555
Copy link
Contributor Author

Hey @br3no, thanks for the suggestion, I'll definitely look into supporting the tools parameter later on.

@ibeltagy
Copy link

ibeltagy commented Feb 10, 2024

I gave this a try and it worked. Thank you

There's a bug in the dummy_llm; it failed to find an LLM in the path dummy. Here's a simple fix that doesn't need an LLM object because outlines only uses the tokenizer

    def dummy_llm():
        import types
        x = types.SimpleNamespace()
        y = types.SimpleNamespace()
        x.tokenizer = tokenizer
        y.tokenizer = x
        return y

@ibeltagy
Copy link

@br3no, are you suggesting to remove the proposed guided_json=schema option and only support tools?

It seems to me that it doesn't have to be one or the other. OpenAI API doesn't allow a custom json schema for constrained decoding, so this is a new capability that's doesn't fit the exiting API. You are right that the guided json can be used to power tools, but that doesn't necessarily have to replace guided_json=schema.

@br3no
Copy link
Contributor

br3no commented Feb 11, 2024

@ibeltagy vLLM offers two server implementations, one of which mimics the OpenAI API. The nice thing about offering an OpenAI-compatible server is that vLLM works as a drop-in replacement for OpenAI. All the stuff that works with OpenAI simply works with vLLM. As an example, take this project here: https://github.com/jxnl/instructor. (I’m not affiliated, nor do I recommend or know much about this project; it’s just an example)

If you start diverging from the OpenAI API, this can break.

At the same time, if users want to use features the OpenAI API doesn’t offer, they are free to use the vLLM-own server implementation, where the vLLM maintainers and community are free to go beyond what OpenAI offers.

While the OpenAI API doesn’t offer an explicit json schema option, the tools functionality offers a way to do exactly that.

So yes, I don’t see a good argument for adding these parameters to the OpenAI server because:

  1. You can use the json schema functionality through the tools parameters
  2. The risk of breaking compatibility gets higher and the cost of maintenance for this server implementation increases
  3. You are free to use the vLLM-own server if you want, having access to json schema and regex-guided decoding.

@ibeltagy
Copy link

@br3no, makes sense

@felixzhu555
Copy link
Contributor Author

@br3no @ibeltagy the tool parameter is being added through #2488, this PR will only add guided decoding features.

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A biiiig problem here is that we are creating new logits processor per request. A common scenarios will be handling for common schemas/constraints. Can you think of a good (and semantically) way to cache the logits processors or the FSM in outlines?

requirements.txt Outdated Show resolved Hide resolved
tests/entrypoints/test_openai_server_guided_decoding.py Outdated Show resolved Hide resolved
tests/entrypoints/test_openai_server_guided_decoding.py Outdated Show resolved Hide resolved
vllm/entrypoints/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved
vllm/model_executor/guided_decoding.py Outdated Show resolved Hide resolved
vllm/model_executor/guided_decoding.py Outdated Show resolved Hide resolved
@simon-mo
Copy link
Collaborator

simon-mo commented Feb 14, 2024 via email

@simon-mo simon-mo mentioned this pull request Feb 28, 2024
5 tasks
@simon-mo simon-mo enabled auto-merge (squash) February 29, 2024 06:00
@simon-mo simon-mo merged commit 703e42e into vllm-project:main Feb 29, 2024
22 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
@arshadshk
Copy link

Where can I find documentation for using guided decoding?

@felixzhu555
Copy link
Contributor Author

Hi @arshadshk, sorry we don't have written documentation for guided decoding just yet, I'll try to add that soon. If you have a specific use case I can try to explain how to use it, otherwise you can check out the guided decoding tests here for some examples.

@felixzhu555 felixzhu555 deleted the add_structured_gen branch March 12, 2024 19:55
@BedirT
Copy link

BedirT commented Mar 20, 2024

Is this strictly an OpenAI-compatible server feature? I don't see any mention on having this available as part of simple vllm generation calls? If so what is the reason for it?

@simon-mo
Copy link
Collaborator

Currently yes. But it would be valuable to have similar API is LLM class. I will open a new issue to have folks working on it.

@simon-mo
Copy link
Collaborator

@BedirT issue created #3536. I will have someone working on it.

@ProVega
Copy link

ProVega commented Apr 10, 2024

Where can I find documentation for this feature?

@BedirT
Copy link

BedirT commented Apr 10, 2024

I don't think there is a lengthy documentation on it, but it is shortly mentioned here: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-completions-api

@ProVega
Copy link

ProVega commented Apr 10, 2024

That same states the parameters, but doesn't provide any actual samples of JSON SCHEMA. I have found a few others, but I can't find anything that shows how to return an ARRAY of JSON objects. Example: "Find me 5 companies that sell cars" and I get back { Results: [ { name: "Ford" }, {name: "Toyota"}, {name: "BMW}]

@AaronFriel
Copy link

AaronFriel commented Apr 10, 2024

@ProVega Using tools like pydantic or zod-to-json-schema are helpful here. I think the schema you want is, for just an array of names:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "name": { "type": "string" }
    }
  }
}

Or with a top level Results key:

{
  "type": "object",
  "properties": {
    "Results": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" }
        }
      }
    }
  }
}

@sam1am
Copy link

sam1am commented Apr 30, 2024

This took me some tinkering to figure out, given all the different methods and tools that have been discussed regarding this issue. Here is a simple example that will produce guided json output for me consistently:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://localhost:8000/v1",
)

json_schema = {
    "type": "object",
    "properties": {
        "thought": {"type": "string"},
        "answer": {"type": "string"}
    },
    "required": ["thought", "answer"]
}

query = "What is the capital of France?"

# Feed the random question into the existing query
system_prompt = "Respond only with a json object containing the following fields and nothing else: thought, answer."
completion = client.chat.completions.create(
    model="casperhansen/llama-3-8b-instruct-awq",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ],
    extra_body={
        "stop_token_ids": [128009], 
        "response_format": {"type": "json_object"},
        "guided_json": json_schema
    }
)

print(completion.choices[0].message.content.strip())

Output:

{
"answer": "Paris",
"thought": "Bonjour!" 
}

The only issue is the order, which appears to be alphabetical regardless of how I write the code. This is important for tree of thought-type prompting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet