[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` #4109

br3no · 2024-04-16T08:06:04Z

This PR updates the Outlines Integration from FSM to the new Guide interface.

Since I'm not sure where to place the change, I added both labels [Frontend] and [Core].

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

simon-mo

LGTM.

requirements-common.txt

vllm/model_executor/guided_decoding/outlines_logits_processors.py

Co-authored-by: Simon Mo <simon.mo@hey.com>

simon-mo · 2024-04-18T07:50:10Z

I think the test failure in the entrypoints test might be related, and there's merge conflict. 🙏

Co-authored-by: Simon Mo <simon.mo@hey.com>

…5-outlines-guides

br3no · 2024-04-18T12:29:16Z

I'll have a look at the failing frontend test.

br3no · 2024-04-19T19:59:57Z

I can reproduce the error in the test on my dev environment. The generation does not stop when it should, generating IP addresses like this: 100.101.102.10319216. I'm investigating why this happens and reached out to @rlouf in the outlines discord.

I have pushed some small improvements to the test code:

~~corrected the regex for IPv4~~ the regex was correct. I reverted the change
changed the pytest fixture to make sure the same python interpreter is used when starting the vLLM server for the integration test (see the warning box here: https://docs.python.org/3/library/subprocess.html#subprocess.Popen)

vllm/model_executor/guided_decoding/outlines_logits_processors.py

navster888 · 2024-04-29T17:34:10Z

hey @br3no @simon-mo any updates on getting this PR merged? The pinned version of outlines is preventing us from picking up a bug fix in included in 0.0.40. Is there anyway we can pull the relaxed version constraint into it's own PR to unblock?

br3no · 2024-04-30T17:53:39Z

I'm having a call with outlines contributors on Thursday. While there is no guarantee we will have a solution for the problem, I'd propose to wait until then. If there's no progress by the end of the week, I'd open a separate PR for the unpinning.

What do you think?

br3no · 2024-05-02T12:58:53Z

I have opened #4558 because moving to the Guide API will require outlines-dev/outlines#856 to be fixed first.

br3no · 2024-05-10T15:45:12Z

I have closed #4558 in favor of this PR. I expect to make progress on this next week. Waiting for outlines-dev/outlines#874.

rlouf · 2024-05-11T20:18:32Z

outlines-dev/outlines#874 merged, thank you for your patience!

br3no · 2024-05-12T19:43:20Z

Great, thanks for the support @rlouf!

Can you tell already when you plan to release?

rlouf · 2024-05-19T11:52:14Z

Next week I think. I need to make sure other downstream libraries have pinned the outlines version to avoid surprises.

simon-mo · 2024-05-30T22:18:40Z

Same PR works since this is small enough. also cc @njhill I think you mentioned similar issue

njhill · 2024-06-03T23:41:36Z

Sorry, missed @simon-mo tagging me above. Yes we encountered the same problem and had been thinking to introduce a LogitsProcessorFactory abstract class that can be included in the SamplingParams instead of LogitsProcessors (but allowing those too for backwards compatibility. This could have both create_processor() and return_processor(lp) methods. Latter not required but could be used for pooling.

vLLM would then ensure to call this separately for each sequence. Stateless LP factories can just return a constant LP from the method.

Note this is currently also problem if a list of prompts is passed in the API, and/or if n > 1 like you said (including beam search).

WDYT?

@maxdebayser has started prototyping this.

br3no · 2024-06-04T07:56:18Z

@njhill I like the direction of your proposal. This would allow us to invert control and get rid of the 10 lines setting up guided decoding in the create_chat_completion method. Is there a PR with @maxdebayser's sketch?

While I think this is the right thing to do on the long run, I believe we should fix this problem ASAP. Would you mind having a look at the code in this PR which handles this issue by pushing the cache one level down? I believe this would work nicely as a band-aid until the refactoring you proposed is implemented.

The core changes are:

we are no longer caching the LogitProcessors. They are now created on each request anew.
the LogitsProcessors no longer reset state – so all n sequences can safely share the same state cache in the lifetime of one request.
the Guide object is cached globally for every guide/tokenizer pair. This is the expensive thing we don't want to recompute on every request. (this is how we were caching the LogitProcessors)

br3no · 2024-06-04T08:01:02Z

PS: this PR is ready for review @simon-mo

I'm just waiting for outlines to be released, so that we can get rid of the regression in the tests.

maxdebayser · 2024-06-04T15:55:19Z

Hi @br3no, we also found a problem with the FSM state being shared between sequences. This curl here causes a crash:

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "<MY_MODEL>",
    "prompt": ["An example of a json document: ", "Another example of a json document: "],
    "max_tokens": 100,
    "temperature": 0,
    "guided_decoding_backend": "outlines",
    "response_format": {"type":"json_object"},
    "logit_bias": {"100": -100}
  }'

We have a sketch for a PR here: IBM/vllm#38 . It uses factories like @njhill mentioned, so that each sequence can have it's own logits processor copy.

The changes in our PR solve this particular issue, but I think the CFGLogitsProcessor would still crash if the sequence is preempted with the recompute policy. But I don't know yet how to test this hypothesis.

maxdebayser · 2024-06-04T16:23:16Z

@br3no , I've tested your changes with the curl command above. The code doesn't crash anymore, but I think the output ends prematurely:

{
  "id": "cmpl-d1e40aa8af734fcc98238fa5e0c2ecac",
  "object": "text_completion",
  "created": 1717517730,
  "choices": [
    {
      "index": 0,
      "text": "\n\n\n{\n",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    },
    {
      "index": 1,
      "text": "\n\n\n{",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 29,
    "completion_tokens": 11
  }
}

So while I do think that your changes are an improvement, I think we need to add the factory PR as well to deal with stateful LogitsProcessors properly. Just for reference, the curl request returns this on our PR:

{
  "id": "cmpl-861186441f9c459c981fcab5abd33f5b",
  "object": "text_completion",
  "created": 1717515348,
  "choices": [
    {
      "index": 0,
      "text": "\n\n\n{\n\"name\" \n: \n\"John Doe\"\n,\n\"age\" \n: 30\n,\n\"city\" \n: \n\"New York\"\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    },
    {
      "index": 1,
      "text": "\n\n\n{\n\"name\" \n: \n\"John Doe\"\n,\n\"age\" \n: 30\n,\n\"city\" \n: \n\"New York\"\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 218,
    "completion_tokens": 200
  }
}

maxdebayser

LGTM

maxdebayser

Actually, if I run

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "<MY_MODEL>",
    "prompt": ["An example of a json document: ", "Another example of a json document: "],
    "max_tokens": 100,
    "temperature": 0,
    "guided_decoding_backend": "outlines",
    "response_format": {"type":"json_object"},
    "logit_bias": {"100": -100}
  }'

followed by

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "<MY_MODEL>",
    "prompt": ["An example of a json document: "],                                        
    "max_tokens": 100,
    "temperature": 0,
    "guided_decoding_backend": "outlines",
    "response_format": {"type":"json_object"},
    "logit_bias": {"100": -100}
  }' | jq

I get this crash:

    File "/home/develop/.local/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 47, in __call__
    instruction = self._guide.get_next_instruction(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/develop/.local/lib/python3.11/site-packages/outlines/fsm/guide.py", line 349, in get_next_instruction
    interactive.exhaust_lexer()
  File "/opt/vllm/lib/python3.11/site-packages/lark/parsers/lalr_interactive_parser.py", line 52, in exhaust_lexer
    return list(self.iter_parse())
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/lib/python3.11/site-packages/lark/parsers/lalr_interactive_parser.py", line 43, in iter_parse
    for token in self.lexer_thread.lex(self.parser_state):
  File "/opt/vllm/lib/python3.11/site-packages/lark/lexer.py", line 674, in lex
    raise UnexpectedToken(token, e.allowed, state=parser_state, token_history=[last_token], terminals_by_name=self.root_lexer.terminals_by_name)
lark.exceptions.UnexpectedToken: Unexpected token Token('LBRACE', '{') at line 7, column 2.
Expected one of: 
	* UNESCAPED_STRING
	* RBRACE
Previous tokens: [Token('LBRACE', '{')]

If I restart the server and just send the curl request with a single prompt several times, only the first request generates useful JSON. After that it returns:

{
  "id": "cmpl-948cb1c4033b40f1a07846bed5ac4de9",
  "object": "text_completion",
  "created": 1717518689,
  "model": "/llama_eval_storage/LLaMa/models/hf/7B-F",
  "choices": [
    {
      "index": 0,
      "text": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 109,
    "completion_tokens": 100
  }
}

I'm testing with LLama2 7b and outlines==0.0.41

br3no · 2024-06-04T16:43:48Z

@maxdebayser thanks for looking into it!

we also found a problem with the FSM state being shared between sequences

I have just looked into Outlines and while RegexGuide is thread-safe, CFGGuide is not.

I'll check if I can increase the size of the band-aid a bit to deal with this case...

br3no · 2024-06-05T09:04:38Z

I have pushed a commit that comes close to what was there before and at the same time does not lead to crashes on n > 1. It's still not correct, though...

Every CFGGuide can only be used by one sequence at a time. We need the factory idea in IBM/vllm#38. Note that the performance will still be horrible, since we will create a new Guide for every sequence. This is unavoidable, unfortunately.

It probably makes sense to pre-build a (large) pool of CFGGuides for the use-case where request.response_format.type == "json_object". I believe this is a pragmatic and useful solution, since this will be the most common case.

To really solve this, Outlines would need to be changed to make the CFGGuide thread-safe. I believe this would require a large effort.

@rlouf, could you give us your expert opinion on this?

maxdebayser · 2024-06-05T17:20:15Z

@br3no , I can confirm that your latest commit fixes the problem where state from previous single-sequence requests is carried over to new requests.

simon-mo

Thank you for doing this. Please let me know when this PR is ready to be merged!

simon-mo · 2024-06-05T23:49:09Z

Actually it seems complete given @maxdebayser's comment. I will merge now.

…llm-project#4109) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>

Update Outlines Integration from FSM to Guide vllm-project#3715

05888b6

simon-mo self-assigned this Apr 17, 2024

simon-mo mentioned this pull request Apr 18, 2024

[Feature]: No outlines strong dependency #4153

Open

simon-mo approved these changes Apr 18, 2024

View reviewed changes

requirements-common.txt Outdated Show resolved Hide resolved

vllm/model_executor/guided_decoding/outlines_logits_processors.py Outdated Show resolved Hide resolved

Update vllm/model_executor/guided_decoding/outlines_logits_processors.py

523a245

Co-authored-by: Simon Mo <simon.mo@hey.com>

br3no and others added 3 commits April 18, 2024 11:17

Unpin outlines

b30fad8

Co-authored-by: Simon Mo <simon.mo@hey.com>

Merge remote-tracking branch 'upstream/main' into 3715-outlines-guides

1c8113a

Merge branch '3715-outlines-guides' of github.com:br3no/vllm into 371…

04d9f8d

…5-outlines-guides

br3no added 2 commits April 19, 2024 21:51

small improvements to test code

636d9a3

fixed formatting...

adc4b02

reverting change in regex

1bf1556

This was referenced Apr 21, 2024

Fix/283 ray timeout ScandEval/ScandEval#410

Merged

[BUG] Outlines version clash with vLLM ScandEval/ScandEval#414

Closed

mgoin mentioned this pull request Apr 24, 2024

[Misc] Upgrade outlines to v0.0.41 #4330

Closed

njhill reviewed Apr 24, 2024

View reviewed changes

vllm/model_executor/guided_decoding/outlines_logits_processors.py Show resolved Hide resolved

adding explicit error when the type is not Generate or Write

e6ffb1a

This was referenced May 2, 2024

Endless generation bug popped up during migration to Guide in vLLM integration outlines-dev/outlines#856

Closed

[CI/Build] Unpin outlines #4558

Closed

This was referenced May 25, 2024

[Bugfix] Adds outlines performance improvement #5006

Open

[Bugfix] Adds outlines performance improvement #5053

Draft

Breno Faria added 4 commits May 31, 2024 11:15

Merge branch 'main' into 3715-outlines-guides

7d76997

push cache down

02239aa

removing unused imports

48877d4

fix test

dd5b67b

dongxiaolong mentioned this pull request Jun 3, 2024

v0.4.3 Release Tracker #4895

Closed

6 tasks

simon-mo mentioned this pull request Jun 3, 2024

v0.5.0 Release Tracker #5224

Closed

2 tasks

Merge branch 'vllm-project:main' into 3715-outlines-guides

1361c8a

maxdebayser approved these changes Jun 4, 2024

View reviewed changes

maxdebayser suggested changes Jun 4, 2024

View reviewed changes

increase size of band-aid and move to outlines 0.0.43

20ede84

simon-mo approved these changes Jun 5, 2024

View reviewed changes

simon-mo merged commit 7b0a0df into vllm-project:main Jun 5, 2024
101 of 103 checks passed

blinkbear pushed a commit to blinkbear/vllm that referenced this pull request Jun 6, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide (v…

ef0e4b7

…llm-project#4109) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>

maxdebayser mentioned this pull request Jun 7, 2024

[Core] Fix sharing of stateful logits processors #5329

Open

maxdebayser mentioned this pull request Jun 13, 2024

[Bug]: sending request using response_format json twice breaks vLLM #4070

Open

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide (v…

e22d5ed

…llm-project#4109) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide (v…

29c6f8a

…llm-project#4109) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide (v…

98072da

…llm-project#4109) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` #4109

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` #4109

br3no commented Apr 16, 2024

simon-mo left a comment

simon-mo commented Apr 18, 2024

br3no commented Apr 18, 2024

br3no commented Apr 19, 2024 •

edited

Loading

navster888 commented Apr 29, 2024

br3no commented Apr 30, 2024

br3no commented May 2, 2024

br3no commented May 10, 2024

rlouf commented May 11, 2024

br3no commented May 12, 2024

rlouf commented May 19, 2024

simon-mo commented May 30, 2024

njhill commented Jun 3, 2024

br3no commented Jun 4, 2024

br3no commented Jun 4, 2024

maxdebayser commented Jun 4, 2024 •

edited

Loading

maxdebayser commented Jun 4, 2024

maxdebayser left a comment

maxdebayser left a comment

br3no commented Jun 4, 2024

br3no commented Jun 5, 2024

maxdebayser commented Jun 5, 2024

simon-mo left a comment

simon-mo commented Jun 5, 2024

[Frontend][Core] Update Outlines Integration from FSM to Guide #4109

[Frontend][Core] Update Outlines Integration from FSM to Guide #4109

Conversation

br3no commented Apr 16, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo commented Apr 18, 2024

br3no commented Apr 18, 2024

br3no commented Apr 19, 2024 • edited Loading

navster888 commented Apr 29, 2024

br3no commented Apr 30, 2024

br3no commented May 2, 2024

br3no commented May 10, 2024

rlouf commented May 11, 2024

br3no commented May 12, 2024

rlouf commented May 19, 2024

simon-mo commented May 30, 2024

njhill commented Jun 3, 2024

br3no commented Jun 4, 2024

br3no commented Jun 4, 2024

maxdebayser commented Jun 4, 2024 • edited Loading

maxdebayser commented Jun 4, 2024

maxdebayser left a comment

Choose a reason for hiding this comment

maxdebayser left a comment

Choose a reason for hiding this comment

br3no commented Jun 4, 2024

br3no commented Jun 5, 2024

maxdebayser commented Jun 5, 2024

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo commented Jun 5, 2024

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` #4109

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` #4109

br3no commented Apr 19, 2024 •

edited

Loading

maxdebayser commented Jun 4, 2024 •

edited

Loading