-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM ignores my requests when I increase the number of concurrent requests #2752
Comments
There could be an issue with your processing script, I've managed to have hundreds of concurrent requests in flight without seeing issues like this. Would you be able to share your processing script? |
@savannahfung @WoosukKwon @hmellor I'm experiencing a similar issue as described above, including with "vllm.entrypoints.api_server". Initially, it handles concurrent requests effectively, but eventually, it starts to hang. My assumption is that this might be related to the GPU KV cache steadily increasing to 99.4%, leading to crashes and hang-ups, thereby leaving other requests in a "pending" state. It's highly probable that this issue is related to #2731. |
@hmellor I am testing the performance of different llms and the script works for smaller number of concurrent requests. For openchat/openchat_3.5, it hangs at around 10 concurrent requests, but if I use larger models (e.g. Nexusflow/NexusRaven-13B, mistralai/Mixtral-8x7B-Instruct-v0.1), it hangs earlier at around 2-3 concurrent requests.
|
@nehalvaghasiya That's weird, because just before it hangs the GPU KV cache usage was just 4.2% and the CPU KV cache usage was 0.0%. Unless it just suddenly spiked to 99.9%.
|
|
@savannahfung I meant the part of your script that makes the requests, you've provided too much extra code to find where the problem is. Can you make a minimal reproducer? I'm following up because I am also now seeing this issue where 100 concurrent requests will indefinitely be swapped in and out of the pending queue as described by @nehalvaghasiya. |
Hi all, This is how I start the openai server: This is my Python code to produce the error: import asyncio
from openai import AsyncOpenAI
model_name='mistralai/Mistral-7B-Instruct-v0.2'
client=AsyncOpenAI(api_key="EMPTY",base_url=f"http://localhost:8001/v1/")
async def _send_chat_completion(messages):
completion = await client.chat.completions.create(model=model_name, messages=messages, temperature=0.0)
return completion.choices[0].message.content.strip()
async def _send_async_requests(prompts_messages):
tasks = [_send_chat_completion(msgs) for msgs in prompts_messages]
responses = await asyncio.gather(*tasks)
return responses
prompts_msgs = [{'role': 'user', 'content': 'suggest a dinner meal'}]
print('Starting first run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))
print('Starting second run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5)) The second run never finishes and the server logs don't even mention it as incoming requests. I want to point other users facing similar issue to the issue on the corresponding openai github page where they reported that they are actively working on the fix but it seems to be a more serious issue related to other modules used by openai (see openai/openai-python#769). My workaround was to use raw requests where I did not see this error happening (albeit openai reports in the above linked issue that you might encounter the same issue there, too). Adjusting above code looks like this: import asyncio
import aiohttp
async def _send_chat_completion(messages):
print('starting openai request')
async with aiohttp.ClientSession() as session:
response = await session.post(url="http://localhost:8001/v1/chat/completions",
json={"messages": messages, "model": "mistralai/Mistral-7B-Instruct-v0.2"},
headers={"Content-Type": "application/json"})
return await response.json()
async def _send_async_requests(prompts_messages):
tasks = [_send_chat_completion(msgs) for msgs in prompts_messages]
responses = await asyncio.gather(*tasks)
responses = [resp['choices'][0]['message']['content'].strip() for resp in responses]
return responses
prompts_msgs = [{'role': 'user', 'content': 'suggest a dinner meal'}]
print('Starting first run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))
print('Starting second run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5)) Let's hope it gets fixed quickly. |
I am using a runpod container to run vLLM.
Template: runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
GPU Cloud: 1 x RTX 3090 | 12 vCPU 31 GB RAM
It works perfectly fine when I send 9 concurrent requests but it starts to hang when I increase it to 10.
python -m vllm.entrypoints.openai.api_server --model openchat/openchat_3.5 --tensor-parallel-size 1
It just stop processing the last input and hangs there.
I tried to inlcude
--swap-space 0
but the error still exists, nothing changes.The text was updated successfully, but these errors were encountered: