Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 12.1 vllm==0.2.3 Double Free #1930

Closed
tjtanaa opened this issue Dec 5, 2023 · 6 comments
Closed

CUDA 12.1 vllm==0.2.3 Double Free #1930

tjtanaa opened this issue Dec 5, 2023 · 6 comments
Labels
duplicate This issue or pull request already exists

Comments

@tjtanaa
Copy link
Contributor

tjtanaa commented Dec 5, 2023

I tried this with FastChat that uses vLLM backend:
Both inputs:

1. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=False,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)
2. openai.ChatCompletion.create(
        model=model,
        messages=(
        [
            {"role": "user", "content": prompt}
        ]
        ),
        stream=True,
        # temperature=args.temperature,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        max_tokens=max_tokens,
        best_of=best_of,
        n=n,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        use_beam_search=True)

raises the following error

2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish  
2023-12-05 08:52:19 | ERROR | stderr |     task.result()                                                                                            
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop            
2023-12-05 08:52:19 | ERROR | stderr |     has_requests_in_progress = await self.engine_step()                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 338, in engine_step                
2023-12-05 08:52:19 | ERROR | stderr |     request_outputs = await self.engine.step_async()                                                         
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/async_llm_engine.py", line 199, in step_async                 
2023-12-05 08:52:19 | ERROR | stderr |     return self._process_model_outputs(output, scheduler_outputs) + ignored                                  
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 545, in _process_model_outputs           
2023-12-05 08:52:19 | ERROR | stderr |     self._process_sequence_group_outputs(seq_group, outputs)                                                 
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/engine/llm_engine.py", line 537, in _process_sequence_group_outputs  
2023-12-05 08:52:19 | ERROR | stderr |     self.scheduler.free_seq(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/scheduler.py", line 310, in free_seq                            
2023-12-05 08:52:19 | ERROR | stderr |     self.block_manager.free(seq)                                                                             
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 277, in free                            
2023-12-05 08:52:19 | ERROR | stderr |     self._free_block_table(block_table)                                                                      
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 268, in _free_block_table               
2023-12-05 08:52:19 | ERROR | stderr |     self.gpu_allocator.free(block)                                                                           
2023-12-05 08:52:19 | ERROR | stderr |   File "/home/tan/tjtanaa/vllmcu12/vllm/core/block_manager.py", line 48, in free                             
2023-12-05 08:52:19 | ERROR | stderr |     raise ValueError(f"Double free! {block} is already freed.")                                              
2023-12-05 08:52:19 | ERROR | stderr | ValueError: Double free! PhysicalTokenBlock(device=Device.GPU, block_number=11634, ref_count=0) is already freed.                                                       
@WoosukKwon
Copy link
Collaborator

Hi @tjtanaa thanks for reporting the bug! Which model are you using? Is it Mistral?

@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 5, 2023

Yes. I am using openhermes-2.5 which is based on Mistral

@qati
Copy link

qati commented Dec 21, 2023

It is happening for me as well, cuda 12.1, vllm 0.2.6 with Mixtral 8x7B, for long prompts.

@qati
Copy link

qati commented Dec 27, 2023

@WoosukKwon any tips on this?

@nxphi47
Copy link

nxphi47 commented Jan 4, 2024

+1

@jonaslsaa
Copy link

jonaslsaa commented Jan 11, 2024

+1, Samme issue here using CUDA/12.1.1, Python/3.10.4-GCCcore-11.3.0, vllm==0.2.3. Happened after 5-10 inferences with a lora fine tuned mistral 7b model
vllm_double_free_bug.log

EDIT: In our case the fine-tuned model was trained with 1024 input tokens, when this was exceeded it caused the double free error.

@hmellor hmellor added the duplicate This issue or pull request already exists label Mar 9, 2024
@hmellor hmellor closed this as not planned Won't fix, can't repro, duplicate, stale Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

6 participants