Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during inference with Mixtral 7bx8 GPTQ #2271

Closed
mlinmg opened this issue Dec 26, 2023 · 8 comments
Closed

Error during inference with Mixtral 7bx8 GPTQ #2271

mlinmg opened this issue Dec 26, 2023 · 8 comments
Labels
duplicate This issue or pull request already exists

Comments

@mlinmg
Copy link

mlinmg commented Dec 26, 2023

Traceback (most recent call last):
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
task.result()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
has_requests_in_progress = await self.engine_step()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 338, in engine_step
request_outputs = await self.engine.step_async()
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 199, in step_async
return self._process_model_outputs(output, scheduler_outputs) + ignored
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 562, in _process_model_outputs
self._process_sequence_group_outputs(seq_group, outputs)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/llm_engine.py", line 554, in _process_sequence_group_outputs
self.scheduler.free_seq(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/scheduler.py", line 312, in free_seq
self.block_manager.free(seq)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 277, in free
self._free_block_table(block_table)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 268, in _free_block_table
self.gpu_allocator.free(block)
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/core/block_manager.py", line 48, in free
raise ValueError(f"Double free! {block} is already freed.")
ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=2611, ref_count=0) is already freed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call
await super().call(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in call
response = await self.dispatch_func(request, call_next)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 63, in add_cors_header
response = await call_next(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next
raise app_exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro
await self.app(scope, receive_or_disconnect, send_no_error)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call
raise e
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app
raw_response = await run_endpoint_function(
File "/home/marco/miniconda3/envs/serving/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/marco/Scrivania/TESI/serving/vllm_server.py", line 137, in generate
async for request_output in results_generator:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 445, in generate
raise e
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 439, in generate
async for request_output in stream:
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 70, in anext
raise result
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
raise exc
File "/home/marco/Scrivania/TESI/serving/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

@oushu1zhangxiangxuan1
Copy link
Contributor

oushu1zhangxiangxuan1 commented Dec 27, 2023

Got the same error with the origin model

@adamlin120
Copy link

got the same error with finetuned Mixtral 7bx8

@iibw
Copy link

iibw commented Jan 3, 2024

I tried to load a GPTQ version of Mixtral 8x7b and got an error, but a different one than posted here.

I got:

config.py gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
config.py gptq does not support CUDA graph yet. Disabling CUDA graph.
worker.py -- Started a local Ray instance.
llm_engine.py Initializing an LLM engine with config: model='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, enforce_eager=True, seed=0)
Traceback (most recent call last):
  File "local_path/mixtral_vllm.py", line 3, in <module>
    llm = LLM("model_path/TheBloke_Mixtral-8x7B-Instruct-v0.1-GPTQ", quantization="GPTQ", tensor_parallel_size=4)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 250, in from_engine_args
    engine = cls(*engine_configs,
             ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
                  ^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=X, ip=X.X.X.X, actor_id=X, repr=<vllm.engine.ray_utils.RayWorkerVllm object at X>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "anaconda_env_path/lib/python3.11/site-packages/vllm/model_executor/model_loader.py", line 55, in get_model
    raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method gptq. Supported dtypes: [torch.float16]

I tried changing the datatype in the config.json to torch.float16 to try and fix it but instead got the same error as in #2251. Maybe these two errors are actually the same and related to vLLM not supporting torch.bfloat16? @casper-hansen

@casper-hansen
Copy link
Contributor

You need to use float16 or half for quantization.

@iibw
Copy link

iibw commented Jan 4, 2024

@casper-hansen

You need to use float16 or half for quantization.

I switched it to torch.float16 in the config.json and my error changed to the one in #2251

@casper-hansen
Copy link
Contributor

Did you try upgrading to the latest vLLM?

@iibw
Copy link

iibw commented Jan 4, 2024

I'll try doing that now

@iibw
Copy link

iibw commented Jan 4, 2024

Yep! It seems like the latest vLLM has fixed this bug. Both GPTQ and AWQ are working for me now. Thanks for the help :)

@hmellor hmellor added the duplicate This issue or pull request already exists label Mar 9, 2024
@hmellor hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

6 participants