Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral AWQ fails to work: asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990 #2621

Closed
pseudotensor opened this issue Jan 27, 2024 · 12 comments

Comments

@pseudotensor
Copy link

export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git --upgrade
export CUDA_VISIBLE_DEVICES=1

python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100

Any where, even simple, leads to:

INFO 01-27 01:15:31 api_server.py:209] args: Namespace(host='0.0.0.0', port=5002, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instru>
WARNING 01-27 01:15:31 config.py:176] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-27 01:15:31 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=>
INFO 01-27 01:15:33 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 01-27 01:17:50 llm_engine.py:316] # GPU blocks: 12486, # CPU blocks: 2048
INFO 01-27 01:17:51 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-27 01:17:51 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 01-27 01:18:04 model_runner.py:689] Graph capturing finished in 13 secs.
INFO 01-27 01:18:04 serving_chat.py:260] Using default chat template:
INFO 01-27 01:18:04 serving_chat.py:260] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'ass>
INFO:     Started server process [276444]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit)
INFO 01-27 01:18:14 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:24 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:34 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:41 async_llm_engine.py:433] Received request cmpl-cd9d75c607614e7db704b01164bc0c83-0: prompt: None, prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.14000000000000012, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early>
INFO:     52.0.25.199:43684 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-27 01:18:41 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in wrap
    await func()
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
    message = await receive()
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
    await self.message_event.wait()
  File "/ephemeral/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
    await self.asgi_callable(scope, receive, wrapped_send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 254, in __call__
    async with anyio.create_task_group() as task_group:
  File "/ephemeral/vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 01-27 01:18:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:51 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 01-27 01:18:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:01 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:11 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:21 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:31 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:33 async_llm_engine.py:112] Finished request cmpl-cd9d75c607614e7db704b01164bc0c83-0.
INFO 01-27 01:19:44 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:19:54 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:04 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:14 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:20:24 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

@pseudotensor
Copy link
Author

When I try to use a different model, I get other errors:

export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git --upgrade
export CUDA_VISIBLE_DEVICES=1

python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model casperhansen/mixtral-instruct-awq --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100
INFO:     52.0.25.199:53634 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
    await self.asgi_callable(scope, receive, wrapped_send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/ephemeral/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in create_completion
    return JSONResponse(content=generator.model_dump(),
AttributeError: 'ErrorResponse' object has no attribute 'model_dump'

@pseudotensor
Copy link
Author

Same exception for https://huggingface.co/TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ

INFO 01-27 01:46:22 llm_engine.py:316] # GPU blocks: 18691, # CPU blocks: 2048
INFO 01-27 01:46:23 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-27 01:46:23 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decreas>
INFO 01-27 01:46:36 model_runner.py:689] Graph capturing finished in 13 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-27 01:46:36 serving_chat.py:260] Using default chat template:
INFO 01-27 01:46:36 serving_chat.py:260] {% for message in messages %}{{'<|im_start|>' + message['role'] + '
INFO 01-27 01:46:36 serving_chat.py:260] ' + message['content'] + '<|im_end|>' + '
INFO 01-27 01:46:36 serving_chat.py:260] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 01-27 01:46:36 serving_chat.py:260] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [277174]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit)
INFO 01-27 01:46:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:46:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:47:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:48:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:49:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:26 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:36 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:46 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:50:56 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:06 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:16 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-27 01:51:22 async_llm_engine.py:433] Received request cmpl-f9ebefcdb0b940ff9edb641a0cc03634-0: prompt: None, prefix_pos: None,sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.14000000000000012, frequency_penalty=0.0, repetition_penalt>
INFO:     24.4.148.180:50012 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-27 01:51:22 llm_engine.py:871] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 257, in wrap
    await func()
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/responses.py", line 234, in listen_for_disconnect
    message = await receive()
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 580, in receive
    await self.message_event.wait()
  File "/ephemeral/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f318c3b1900

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

@pseudotensor
Copy link
Author

On 0.2.7 release, never get any answer back even if seems to generate.

@pseudotensor
Copy link
Author

pseudotensor commented Jan 27, 2024

TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ on 0.2.7 just hangs in generation. never shows any output.

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ/discussions/3

@hediyuan
Copy link

When I try to use a different model, I get other errors:

export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git --upgrade
export CUDA_VISIBLE_DEVICES=1

python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model casperhansen/mixtral-instruct-awq --quantization awq --dtype auto --seed 1234 --tensor-parallel-size=1 --max-num-batched-tokens=66560 --max-log-len=100
INFO:     52.0.25.199:53634 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/ephemeral/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/aioprometheus/asgi/middleware.py", line 184, in __call__
    await self.asgi_callable(scope, receive, wrapped_send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/ephemeral/vllm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/ephemeral/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/ephemeral/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in create_completion
    return JSONResponse(content=generator.model_dump(),
AttributeError: 'ErrorResponse' object has no attribute 'model_dump'

Encountered the same problem,i use Qwen/Qwen-7B-Chat model。It seems that vllm.entrypoints.openai.api_server.py is not compatible。
When the service is first started, there is also an error:AttributeError: 'AsyncLLMEngine' object has no attribute 'do_log_stats',I have already used the latest 0.2.7 release

@ronensc
Copy link
Contributor

ronensc commented Jan 30, 2024

The error

AttributeError: 'ErrorResponse' object has no attribute 'model_dump'

seems related to the version of pydantic which was recently upgraded #2531

@hmellor
Copy link
Collaborator

hmellor commented Apr 4, 2024

@pseudotensor are you still experiencing this issue?

@pseudotensor
Copy link
Author

Not lately. Can close.

@91he
Copy link

91he commented May 10, 2024

@pseudotensor When using Qwen1.5 awq model, I still have this problem.

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f8387440e50

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

@simon376
Copy link

doesn't seem to be model nor quantization specific, got the same error now using TechxGenus/starcoder2-7b-GPTQ.
I'm using the latest vllm docker image with the following arguments:
--model "TechxGenus/starcoder2-7b-GPTQ" --revision 48dc06e6a6df8a8e8567694ead23f59204fa0d26 -q marlin --enable-chunked-prefill --max-model-len 4096 --gpu-memory-utilization 0.3

last week it still worked, so not sure what the reason is.

tgi-scripts-vllm-starcoder-2-7b-gptq-1  | INFO 05-28 09:47:25 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='TechxGenus/starcoder2-7b-GPTQ', speculative_config=None, tokenizer='TechxGenus/starcoder2-7b-GPTQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=48dc06e6a6df8a8e8567694ead23f59204fa0d26, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TechxGenus/starcoder2-7b-GPTQ)

@josephrocca
Copy link

josephrocca commented Jun 2, 2024

FWIW, I had this CancelledError: Cancelled by cancel scope error when sending several concurrent requests, and I solved it by changing --gpu-memory-utilization 0.99 to --gpu-memory-utilization 0.96, so it might have to do with some sort of silent allocation failure that is helped by having a bit of a "buffer". In my case I had two 4090s on Runpod using vllm/vllm-openai:v0.4.3, running a 70B GPTQ model. These were the other command line options that I had:

--quantization gptq --dtype float16 --enforce-eager --tensor-parallel-size 2 --max-model-len 4096 --kv-cache-dtype fp8

@hmellor @pseudotensor Maybe this could this be re-opened for the benefit of others, since two others have recently commented with similar issues since it was closed.

Potentially related:

@hmellor
Copy link
Collaborator

hmellor commented Jun 2, 2024

The same generic error in starlette does not mean the problem is the same, especially when the original error was reported 4 months ago using a (now) old version of vLLM.

The different traces shared in the thread comments are not that similar.

If someone is experiencing the error they should open an issue with instructions on how to reproduce the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants