Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess. #6152

Closed
excelsimon opened this issue Jul 5, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@excelsimon
Copy link

Your current environment

vllm version: '0.5.0.post1'

🐛 Describe the bug

When I set tensor_parallel_size=1, it works well.
But, if I set tensor_parallel_size>1, below error occurs:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.
After I add

import torch
import multiprocessing
torch.multiprocessing.set_start_method('spawn')

the same RuntimeError still occurs.

@excelsimon excelsimon added the bug Something isn't working label Jul 5, 2024
@youkaichao
Copy link
Member

please paste your full code. you might initialized cuda before using vLLM.

@yuchenlin
Copy link

I'm also having the same issue with the latest version of vllm + gemma-2-27B-it

@yuchenlin
Copy link

export VLLM_WORKER_MULTIPROC_METHOD=spawn may help

@youkaichao
Copy link
Member

I can run the following code without any issues:

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="google/gemma-2-27b-it", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

@excelsimon
Copy link
Author

export VLLM_WORKER_MULTIPROC_METHOD=spawn may help

It works for me. Thank you~

@yuchenlin
Copy link

yuchenlin commented Jul 8, 2024

I can run the following code without any issues:

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="google/gemma-2-27b-it", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

i'm not sure. maybe it's about the version issue of powerinfer? and I actually find that even I used the above solution to make vllm able to generate, the quality is not as good as other Gemma-2-27B inference (both under greedy decoding).

@henry-y
Copy link

henry-y commented Jul 24, 2024

export VLLM_WORKER_MULTIPROC_METHOD=spawn may help

it also works for me! thank you!

@CharlesRiggins
Copy link
Contributor

I have encountered the same issue. I solved it by making VLLM_WORKER_MULTIPROC_METHOD=spawn as mentioned by @yuchenlin
Now I'm wondering why I get this error and why setting VLLM_WORKER_MULTIPROC_METHOD solves it. Please help us know and make it clear.

@lonngxiang
Copy link

not work for me

export VLLM_WORKER_MULTIPROC_METHOD=spawn;CUDA_VISIBLE_DEVICES=1 vllm serve  /ai/minicpmv --host 192.168.2.238 --port 10868 --max-model-len 10000 --trust-remote-code --api-key token-abc123 --gpu_memory_utilization 1 --trust-remote-code 

image

@rin2401
Copy link

rin2401 commented Sep 19, 2024

from peft import PeftModel, PeftConfig
i import peft made this error too

@youkaichao
Copy link
Member

@rin2401 try to use distributed_executor_backend="ray" ?

@IIDCII
Copy link

IIDCII commented Oct 22, 2024

export VLLM_WORKER_MULTIPROC_METHOD=spawn may help

if you're using this in a Python notebook run the following first on a reset kernel:

import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants