[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

jens-create · 2024-03-22T12:11:32Z

Anything you want to discuss about vllm.

Hi,

I am running some benchmarks on the vllm.entrypoints.openai.api_server measuring latency and throughput with different number of concurrent requests.

Specs:

H100 80GB
qwen-1.5-14B-chat

I am sending 1000 requests with random prompts of token length 512. These are the results I get (see attached image):

Guided_json

~100 running requests
~70 generation tokens per second
~1700 ms median token time

Non-guided_json

~100 running requests
~800 generation tokens per second
~75 ms median token time (TPOT)

At 10 concurrent request (GPU utlization << 100%

Non-guided_json: ~20 ms median token time
guided_json: ~ 160 ms median token time

Currently the application I am building heavily relies on guided_json, however, to put it in an online setting I would like to ask 1) are the numbers I experience sensible and 2) what can be done to improve performance in the guided_json paradigm?

I am debating whether I should try and prompt my way to structured outputs and thus avoiding constrained decoding.

)

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-03-22T16:27:22Z

Is the JSON schema complex at all, and is it the same each time? The 70 toks/s number is a bit lower than I expected. This can be due to several factor if it's the same schema:

Currently the logits mask computation is performed on the critical path but it can be moved earlier.
We currently don't batch the application of logits processors.
Python overhead in general

I'm interested in fixing the performance here.

jens-create · 2024-03-25T10:34:38Z

Hi Simon,

The JSON schema is the same at all times, and it is as follows:

"guided_json": {"$defs": {"SearchQuery": {"description": "Search query for the retrieval task.", "properties": {"query_type": {"description": "The type of query most effective for handling the retrieval task.", "title": "Query Type", "type": "string"}, "query": {"description": "A random user's search query.", "title": "Query", "type": "string"}}, "required": ["query_type", "query"], "title": "SearchQuery", "type": "object"}}, "description": "A list of search queries anticipating a user looking for information from a given web page.", "properties": {"queries": {"description": "Brainstormed search queries for the given web page.", "items": {"$ref": "#/$defs/SearchQuery"}, "title": "Queries", "type": "array"}}, "required": ["queries"], "title": "Brainstorm", "type": "object"}

Thanks for looking into this 🫶

jens-create · 2024-04-23T09:16:17Z

@simon-mo any update on this? 😊

taoisu · 2024-04-24T00:53:41Z

Facing similar issue here, I have a json with 14 fields, the request stucks forever.

lithafnium · 2024-05-07T00:57:41Z

My schema only has 2 fields and also has significant latency issues than when using without guided_json. Would love to have this fixed as model performance severely decreases without it.

simon-mo · 2024-05-07T03:42:39Z

I would suggest trying out setting --guided-decoding-backend lm-format-enforcer (through args) or "guided_decoding_backend": "lm-format-enforcer" as part of the request to see whether it helps. see original PR here: #3868 (cc @noamgat)

noamgat · 2024-05-07T07:01:03Z

If testing lm-format-enforcer, I highly recommend adding the latest version of it to the image, as there have been performance improvements to the JsonSchemaParser. The next version of vLLM will include them, but until them, do pip install lm-format-enforcer==0.10.1 in the image before testing.

jarrelscy · 2024-05-07T07:03:21Z

what speeds are you getting @noamgat vs the outlines backend?

noamgat · 2024-05-07T10:53:56Z

I didn't test on A100/H100s, but on my dev setup (GTX 3090, Mistral7B), for simple schemas, I was getting a less than 2x reduction of tokens/s.

nullpointer0xffff · 2024-05-07T11:36:06Z

+1, it seems not GPU related, I tested with A100 / V100 GPUs both have similar issue.

Using line profiler, I found this get_guided_decoding_logits_processor call takes 93% time

nullpointer0xffff · 2024-05-07T12:11:21Z

If testing lm-format-enforcer, I highly recommend adding the latest version of it to the image, as there have been performance improvements to the JsonSchemaParser. The next version of vLLM will include them, but until them, do pip install lm-format-enforcer==0.10.1 in the image before testing.

This get vllm 0.4.1+cu118 requires lm-format-encorcer==0.9.8, requiring to add --no-deps.

Just tested, the speed up is not obvious, probabbly the main bottleneck is still the get_guided_decoding_logits_processor

nullpointer0xffff · 2024-05-07T12:12:14Z

If testing lm-format-enforcer, I highly recommend adding the latest version of it to the image, as there have been performance improvements to the JsonSchemaParser. The next version of vLLM will include them, but until them, do pip install lm-format-enforcer==0.10.1 in the image before testing.

This get vllm 0.4.1+cu118 requires lm-format-encorcer==0.9.8, requiring to add --no-deps.

Just tested, the speed up is not obvious (25tok/s -> 32 tok/s on V100), probabbly the main bottleneck is still the get_guided_decoding_logits_processor

nullpointer0xffff · 2024-05-07T17:30:03Z

@noamgat here's a profling when I use lm-format-enforcer 0.10.1.

/lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py

Function: _build_regular_tokens_list at line 58

 

Line #      Hits         Time  Per Hit   % Time  Line Contents

==============================================================

    58                                           @profile

    59                                           def _build_regular_tokens_list(tokenizer: PreTrainedTokenizerBase) -> List[Tuple[int, str, bool]]:

    60         1  912794903.0    9e+08      9.5      token_0 = tokenizer.encode("0")[-1]

    61         1       8025.0   8025.0      0.0      regular_tokens = []

    62    128257   28050361.0    218.7      0.3      for token_idx in range(len(tokenizer)):

    63    128256   78294452.0    610.5      0.8          if token_idx in tokenizer.all_special_ids:

    64         2        450.0    225.0      0.0              continue

    65                                                   # We prepend token 0 and skip the first letter of the result to get a space if the token is a start word.

    66    128254 5319568501.0  41476.8     55.3          decoded_after_0 = tokenizer.decode([token_0, token_idx])[1:]

    67    128254 3162992335.0  24661.9     32.9          decoded_regular = tokenizer.decode([token_idx])

    68    128254   56427009.0    440.0      0.6          is_word_start_token = len(decoded_after_0) > len(decoded_regular)

    69    128254   61975079.0    483.2      0.6          regular_tokens.append((token_idx, decoded_after_0, is_word_start_token))

70         1        240.0    240.0      0.0      return regular_tokens

The two decoding in for loop seems took most time. Happy to make further test if needed.

noamgat · 2024-05-07T19:46:56Z

Build regular token list only happens at the first request that uses LMFE. Does it happen every time? If so, maybe there is a problem with lru caching not working.

…

On Tue, May 7, 2024, 20:30 nullpointer0xffff ***@***.***> wrote: @noamgat <https://github.com/noamgat> here's a profling when I use lm-format-enforcer 0.10.1. /lib/python3.10/site-packages/lmformatenforcer/integrations/transformers.py Function: _build_regular_tokens_list at line 58 Line # Hits Time Per Hit % Time Line Contents ============================================================== 58 @Profile 59 def _build_regular_tokens_list(tokenizer: PreTrainedTokenizerBase) -> List[Tuple[int, str, bool]]: 60 1 912794903.0 9e+08 9.5 token_0 = tokenizer.encode("0")[-1] 61 1 8025.0 8025.0 0.0 regular_tokens = [] 62 128257 28050361.0 218.7 0.3 for token_idx in range(len(tokenizer)): 63 128256 78294452.0 610.5 0.8 if token_idx in tokenizer.all_special_ids: 64 2 450.0 225.0 0.0 continue 65 # We prepend token 0 and skip the first letter of the result to get a space if the token is a start word. 66 128254 5319568501.0 41476.8 55.3 decoded_after_0 = tokenizer.decode([token_0, token_idx])[1:] 67 128254 3162992335.0 24661.9 32.9 decoded_regular = tokenizer.decode([token_idx]) 68 128254 56427009.0 440.0 0.6 is_word_start_token = len(decoded_after_0) > len(decoded_regular) 69 128254 61975079.0 483.2 0.6 regular_tokens.append((token_idx, decoded_after_0, is_word_start_token)) 70 1 240.0 240.0 0.0 return regular_tokens The two decoding in for loop seems took most time. Happy to make further test if needed. — Reply to this email directly, view it on GitHub <#3567 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKFA2A6JFYKEQAHJKKQFTTZBEFTBAVCNFSM6AAAAABFDGVKP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJYHE2TQMZUGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

noamgat · 2024-05-08T05:59:30Z

Just clarifying - if possible, start the tokens/s measuring and/or profiling from the second request onwards. While the warm-up time is also something that can be optimized, the post-warmup performance matters much more for real-world use cases. This is true for all guided decoding backends.

Qubitium · 2024-05-08T07:00:00Z

@nullpointer0xffff @jens-create I just confirmed the caching of LMFE tokenizer init (very very slow) via @lru_cache is working so build_regular_tokens_list should never be called past the first request.

vllm/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py

Line 67 in 63575bc

@lru_cache

SaloniGandhi · 2024-05-15T01:41:46Z

maybe we can modify the call method to separate the mask computation from the logits adjustment. This allows the mask to be computed once and reused. let me know if this makes sense @simon-mo

lynkz-matt-psaltis · 2024-05-22T13:27:46Z

Just sharing my experience with this issue - Seems to align with the OPs experience.

Summary: CPU constrained guidance means that batching can't scale correctly.

Vllm 0.4.2
Outlines: 0.0.34
lm_format_enforcer: 0.10.2
Model: Llama 3 8b instruct
Hardware:

Single A100 (80G)
AMD EPYC 7V13 (24 cores)

Single request:

Outlines: ~70 tps - CPU 100%
lm_format_enforcer: ~45 tps - CPU 100%
No guidance: ~140 tps

Batched requests:

Outlines: ~70 tps - CPU 100%
lm_format_enforcer: ~45 tps - CPU 100%
No guidance: ~1000 tps

Guided regex and json both effected:

Example guidance:

regex
~~~response\n# Content\\n([.\\W\\w]+)\\n{2}~{3}

json
{"type":"object","properties":{"test":{"type":"string"}},"required":["test"]}

lynkz-matt-psaltis · 2024-05-23T03:09:06Z

Here's line timings for model_executor/guided_decoding/outlines_logits_processors.py

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    41                                               @line_profiler.profile
    42                                               def __call__(self, input_ids: List[int],
    43                                                            scores: torch.Tensor) -> torch.Tensor:
    44                                                   """Use the FSM to bias the logits before sampling the next token."""
    45      2686      22898.6      8.5      0.1          seq_id = hash(tuple(input_ids))
    46
    47      2686       1994.5      0.7      0.0          if len(input_ids) == 0:
    48         3         13.6      4.5      0.0              self.init_state()
    49                                                   else:
    50      2683        953.2      0.4      0.0              last_token = input_ids[-1]
    51      2683      12007.7      4.5      0.0              last_seq_id = hash(tuple(input_ids[:-1]))
    52      5366      15540.9      2.9      0.0              self.fsm_state[seq_id] = self.fsm.next_state(
    53      2683       2226.4      0.8      0.0                  self.fsm_state[last_seq_id], last_token)
    54
    55      2686    2022417.3    752.9      5.2          allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
    56
    57      5372      83383.0     15.5      0.2          mask = torch.full((scores.shape[-1], ),
    58      2686       3307.1      1.2      0.0                            -math.inf,
    59      2686       1901.2      0.7      0.0                            device=scores.device)
    60      2686   36566141.1  13613.6     94.3          mask[allowed_tokens] = 0
    61      2686      36379.1     13.5      0.1          scores.add_(mask)
    62      2686        794.1      0.3      0.0          return scores

felixzhu555 · 2024-05-23T05:55:45Z

Based on that timing breakdown, can you try to replace mask[allowed_tokens] = 0 by using torch index_fill? e.g. mask.index_fill_(0, allowed_tokens, 0)
This might be faster than manually indexing the mask tensor.

lynkz-matt-psaltis · 2024-05-23T09:14:57Z

I've been doing some further perf analysis and breaking things out a bit to try and understand the bottleneck. Doesn't seem to be related to the indexer but rather, moving the allowed_tokens array around.

cpu first, move to gpu

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    43                                               @line_profiler.profile
    44                                               def __call__(self, input_ids: List[int],
    45                                                            scores: torch.Tensor) -> torch.Tensor:
    46                                                   """Use the FSM to bias the logits before sampling the next token."""
    47      2529      18368.3      7.3      0.1          seq_id = hash(tuple(input_ids))
    48
    49      2529       2418.0      1.0      0.0          if len(input_ids) == 0:
    50         3         22.6      7.5      0.0              self.init_state()
    51                                                   else:
    52      2526        886.1      0.4      0.0              last_token = input_ids[-1]
    53      2526      11457.4      4.5      0.1              last_seq_id = hash(tuple(input_ids[:-1]))
    54      5052      14539.8      2.9      0.1              self.fsm_state[seq_id] = self.fsm.next_state(
    55      2526       1931.4      0.8      0.0                  self.fsm_state[last_seq_id], last_token)
    56
    57      2529    1903376.3    752.6     10.5          allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
    58      2529   15524244.9   6138.5     85.5          allowed_tokens_tensor = torch.tensor(allowed_tokens, dtype=torch.int32, device='cpu')
    59
    60      2529       3262.6      1.3      0.0          if self.mask is None or self.allowed_tokens_tensor is None:
    61      2529      82721.9     32.7      0.5              self.mask = torch.full_like(scores, -math.inf)
    62                                                   else:
    63                                                       self.mask.fill_(-math.inf)
    64                                                   
    65      2529       4009.6      1.6      0.0          if (allowed_tokens_tensor.device != scores.device):
    66      2529     489064.9    193.4      2.7              allowed_tokens_tensor = allowed_tokens_tensor.to(scores.device)
    67                                                       
    68      2529      39004.4     15.4      0.2          allowed_tokens_tensor = allowed_tokens_tensor.to(torch.int64)
    69                                                   
    70      2529      35650.6     14.1      0.2          self.mask.index_fill_(0, allowed_tokens_tensor, 0)
    71      2529      23729.9      9.4      0.1          scores.add_(self.mask)
    72
    73      2529        630.8      0.2      0.0          return scores

straight to gpu:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    43                                               @line_profiler.profile
    44                                               def __call__(self, input_ids: List[int],
    45                                                            scores: torch.Tensor) -> torch.Tensor:
    46                                                   """Use the FSM to bias the logits before sampling the next token."""
    47      2252      14057.0      6.2      0.1          seq_id = hash(tuple(input_ids))
    48
    49      2252       1943.8      0.9      0.0          if len(input_ids) == 0:
    50         3         12.1      4.0      0.0              self.init_state()
    51                                                   else:
    52      2249        696.3      0.3      0.0              last_token = input_ids[-1]
    53      2249       9021.9      4.0      0.1              last_seq_id = hash(tuple(input_ids[:-1]))
    54      4498      14201.8      3.2      0.1              self.fsm_state[seq_id] = self.fsm.next_state(
    55      2249       1836.6      0.8      0.0                  self.fsm_state[last_seq_id], last_token)
    56
    57      2252    1692571.2    751.6     10.5          allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
    58      2252   14277941.7   6340.1     88.3          allowed_tokens_tensor = torch.tensor(allowed_tokens, dtype=torch.int32, device=scores.device)
    59
    60      2252       2582.4      1.1      0.0          if self.mask is None or self.allowed_tokens_tensor is None:
    61      2252      55524.5     24.7      0.3              self.mask = torch.full_like(scores, -math.inf)
    62                                                   else:
    63                                                       self.mask.fill_(-math.inf)
    64                                                   
    65      2252       3560.3      1.6      0.0          if (allowed_tokens_tensor.device != scores.device):
    66                                                       allowed_tokens_tensor = allowed_tokens_tensor.to(scores.device)
    67                                                       
    68      2252      34986.8     15.5      0.2          allowed_tokens_tensor = allowed_tokens_tensor.to(torch.int64)
    69                                                   
    70      2252      32876.8     14.6      0.2          self.mask.index_fill_(0, allowed_tokens_tensor, 0)
    71      2252      22152.8      9.8      0.1          scores.add_(self.mask)
    72
    73      2252        633.6      0.3      0.0          return scores

lynkz-matt-psaltis · 2024-05-23T10:09:57Z

58     12693    9401753.9    740.7     17.2          allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
59     12693   42707835.5   3364.7     78.1          np_allowed_tokens = np.array(allowed_tokens, dtype=np.int32)
60     12693      73736.7      5.8      0.1          allowed_tokens_tensor = torch.from_numpy(np_allowed_tokens)

Halved the cost by using a numpy array first also tried torch.as_tensor but no significant changes.

lynkz-matt-psaltis · 2024-05-23T10:29:57Z

Beyond this, I'm not sure I see a way forward without changes to outlines and lm-format-enforcer to provide the information in a more efficient structure than a List. Does anyone see any memorisation opportunities here to at least reduce the iteration counts?

jarrelscy · 2024-05-23T10:43:09Z

One thing I think we could do to make it faster is to use the fact that allowed_tokens is either almost all the tokens, or none of the tokens. Currently the mask is created at -math.inf, but we could also create the mask at 0 if the length of allowed_tokens is < scores.shape[0]/2 and then fill_ with -math.inf instead?

lynkz-matt-psaltis · 2024-05-23T10:51:41Z

I went down that same line of thinking - I don't think the timings above support it however. Its getting the python List into a Tensor that seems to be 80%+ of the cost per iteration. So short of data structure changes upstream, my current thinking is we're left with iteration optimisations - can we avoid going back to fsm.allowed_token_ids in certain situations. Not sure on that yet - still learning how this all fits together.

JGSweets · 2024-06-26T20:30:28Z

Are the PRs for this issue currently stalled due to competing priorities?

lapp0 · 2024-07-02T14:00:59Z

I believe my PR mitigates the issue, would appreciate some testing to verify dottxt-ai/outlines#1013

It decreases the worst case Outlines structured generation logits processor overhead from 50ms to 1ms.

Please let me know if this doesn't resolve this threads issue and another approach is needed.

Edit:

It's available in Outlines main

pip install git+https://github.com/outlines-dev/outlines

robcaulk · 2024-07-03T07:09:06Z

I did not test the solution yet, but here is the same issue posted to Outlines.

dottxt-ai/outlines#1011 (comment)

I suspected it might be a threading issue in Vllm logit decoding, but sounds like this thread would’ve picked up on that if that were the case.

keeth · 2024-07-04T03:38:18Z

Just anecdotally I am still seeing a 20x slowdown doing batch generation with JSONLogitsProcessor enabled, even with @lapp0's PR. Let me know if i can provide more information.

lapp0 · 2024-07-04T18:56:44Z

Thanks for reporting back

Seems we've tackled one component of slowness, as reported by @lynkz-matt-psaltis "Its getting the python List into a Tensor that seems to be 80%+ of the cost per iteration".

However as suggested by @robcaulk, there may be another source of slowness involving vLLMs Outlines integration involving the ThreadPoolExecutor

vllm/vllm/model_executor/guided_decoding/outlines_decoding.py

Line 71 in 80ca1e6

global_thread_pool = concurrent.futures.ThreadPoolExecutor(

I also have an alternative hypothesis: iff the issue only occurs when using vLLM + ray, perhaps the state of these logits processors is expensive to communicate between ray workers.

Could you please

1. rm -rf ~/.cache/outlines just to be sure the cached RegexFSM isn't from a previous version
1. Share the version details of relevant packages with pip freeze
1. Provide a reproduction script and hardware details (especially the GPU type and count)
1. If you've done any profiling share share that as well

lapp0 · 2024-07-22T22:10:55Z

Is anyone still experiencing this issue? If so, could you please provide the call you're making (if distinct from that of someone previously posting in this thread), and hardware info so I can reproduce?

bannsec · 2024-07-23T02:39:51Z

Did any of these changes make it into a release? Or is all of this still in PRs and such

robcaulk · 2024-07-23T04:45:37Z

Is anyone still experiencing this issue? If so, could you please provide the call you're making (if distinct from that of someone previously posting in this thread), and hardware info so I can reproduce?

Hey Andrew, yes the issue still exists - reproduction is made using the script I furnished in the following thread:

dottxt-ai/outlines#1011

Regarding Vllm+ray, the issue also occurs without
Ray, so there is no evidence to support this theory.

However, using llm-enforcer, we see identical issue, and llm-enforcer uses the same ThreadPoolExecuter. So that would suggest the problem is likely something to do with the logit processing inside Vllm.

lynkz-matt-psaltis · 2024-07-24T01:08:15Z

Hey all,

In terms of changes as I understand them:

Outlines has moved towards tensors being passed between outlines guides and vllm which is a fantastic step for this.
Merged PR: https://github.com/outlines-dev/outlines/pull/1013/files#diff-202c3676a40bf3fd70a140e8e4fa2959cb88548cf134a7f809ad50e0f6b4176d

If we look at the current outlines integration here: https://github.com/outlines-dev/outlines/blob/main/outlines/integrations/vllm.py#L112

We see the use of the caching strategy to reduce repetitive operations.
However the vllm equivalent here is not functionally equivalent in terms of caching: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/outlines_logits_processors.py#L48

So, depending on if you use vllm with outlines integration or use outlines directly, you're like to get different behaviours.

Here's a draft PR taking the caching opportunity in outlines and reapplying it to vllm on top of latest. #6715

Would love feedback on if this helps others. It may be we need the threadpool work as well to realise a true benefit. This helps us at least with batched requests on 1xA100 24 core compute.

To reaffirm - the caching does not change individual request performance, it helps with batched requests.

I've definitely seen similar call stacks and flame graphs to what @robcaulk has described above. Hoping to find time for further investigations on all this around Septemember so if anyone beat me to it fanatastic! :D

lapp0 · 2024-07-25T22:05:54Z

Did any of these changes make it into a release? Or is all of this still in PRs and such

It's in main of outlines still. pip install git+https://github.com/outlines-dev/outlines

However, using llm-enforcer, we see identical issue, and llm-enforcer uses the same ThreadPoolExecuter. So that would suggest the problem is likely something to do with the logit processing inside Vllm.

Thanks so much for investigating! That helps a lot.

Here's a draft PR taking the caching opportunity in outlines and reapplying it to vllm on top of latest. #6715

Will review.

lynkz-matt-psaltis · 2024-07-29T16:11:46Z

New draft PR based on our discussion @lapp0 to remove the duplicate logits processors in VLLM.
#6900

I'm seeing an upstream error with CachedLlamaTokenizerFast here:
https://buildkite.com/vllm/fastcheck/builds/1264#0190ff17-7643-4664-8108-9d2abc4bf589/192-1046

Will investigate if that's not a known issue.

Cheers!

lapp0 · 2024-07-29T20:45:04Z

@lynkz-matt-psaltis thanks for the info!

We need a normalized tokenizer. Could you ensure adapt_tokenizer is called like it is here?

Also beware that adapt_tokenizer will be imported from from outlines.models.vllm import adapt_tokenizer once this PR merges.

Jason-CKY · 2024-08-16T03:33:07Z

Hi, would like to ask if caching the outlines allowed tokens might cause the GPU to OOM in high load, since it is unbounded? Or is there another process within VLLM that will manage VRAM usage?

lapp0 · 2024-08-27T18:34:02Z

Hi, would like to ask if caching the outlines allowed tokens might cause the GPU to OOM in high load, since it is unbounded? Or is there another process within VLLM that will manage VRAM usage?

For each state in the automata, outlines stores a tensor with a list of legal token IDs. However we don't store these tensors on GPU, so it shouldn't result in CudaOOM.

jens-create added the misc label Mar 22, 2024

simon-mo self-assigned this Mar 22, 2024

This was referenced May 23, 2024

[Bugfix] Adds outlines performance improvement #5006

Open

Performance updates for outlines and vllm integration dottxt-ai/outlines#920

Closed

laurens-gs mentioned this issue Jul 29, 2024

[Request] Replicate outlines vLLM performance improvements noamgat/lm-format-enforcer#125

Open

This was referenced Aug 21, 2024

[Feature]: Integrate with Formatron #7752

Open

Efficient batched inference Dan-wanna-M/formatron#3

Open

Dan-wanna-M mentioned this issue Aug 28, 2024

Provide an efficient logits mask index Dan-wanna-M/kbnf#16

Closed

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization #3567

Comments

jens-create commented Mar 22, 2024

Anything you want to discuss about vllm.

simon-mo commented Mar 22, 2024

jens-create commented Mar 25, 2024

jens-create commented Apr 23, 2024

taoisu commented Apr 24, 2024

lithafnium commented May 7, 2024 • edited Loading

simon-mo commented May 7, 2024

noamgat commented May 7, 2024

jarrelscy commented May 7, 2024

noamgat commented May 7, 2024 • edited Loading

nullpointer0xffff commented May 7, 2024

nullpointer0xffff commented May 7, 2024

nullpointer0xffff commented May 7, 2024

nullpointer0xffff commented May 7, 2024

noamgat commented May 7, 2024 via email

noamgat commented May 8, 2024

Qubitium commented May 8, 2024 • edited Loading

SaloniGandhi commented May 15, 2024 • edited Loading

lynkz-matt-psaltis commented May 22, 2024 • edited Loading

Single request:

Batched requests:

Example guidance:

lynkz-matt-psaltis commented May 23, 2024

felixzhu555 commented May 23, 2024

lynkz-matt-psaltis commented May 23, 2024

lynkz-matt-psaltis commented May 23, 2024

lynkz-matt-psaltis commented May 23, 2024

jarrelscy commented May 23, 2024 • edited Loading

lynkz-matt-psaltis commented May 23, 2024

JGSweets commented Jun 26, 2024

lapp0 commented Jul 2, 2024 • edited Loading

robcaulk commented Jul 3, 2024

keeth commented Jul 4, 2024

lapp0 commented Jul 4, 2024

lapp0 commented Jul 22, 2024

bannsec commented Jul 23, 2024

robcaulk commented Jul 23, 2024

lynkz-matt-psaltis commented Jul 24, 2024

lapp0 commented Jul 25, 2024

lynkz-matt-psaltis commented Jul 29, 2024

lapp0 commented Jul 29, 2024

Jason-CKY commented Aug 16, 2024

lapp0 commented Aug 27, 2024

lithafnium commented May 7, 2024 •

edited

Loading

noamgat commented May 7, 2024 •

edited

Loading

Qubitium commented May 8, 2024 •

edited

Loading

SaloniGandhi commented May 15, 2024 •

edited

Loading

lynkz-matt-psaltis commented May 22, 2024 •

edited

Loading

jarrelscy commented May 23, 2024 •

edited

Loading

lapp0 commented Jul 2, 2024 •

edited

Loading