[BugFix] Avoid initializing CUDA too early #3487

njhill · 2024-03-19T02:46:10Z

Care is taken in the code to avoid initializing CUDA prior to CUDA_VISIBLE_DEVICES being set in the worker, but an instance of this was inadvertently introduced in #2569.

Care is taken in the code to avoid initializing CUDA prior to CUDA_VISIBLE_DEVICES being set in the worker, but an instance of this was inadvertently introduced in vllm-project#2569.

Yard1 · 2024-03-19T02:48:29Z

Actually is it possible to somehow validate DeviceConfig inside the worker, after we have set CUDA_VISIBLE_DEVICES?

youkaichao · 2024-03-19T02:55:58Z

Can we use something like in the setup.py?

def _is_cuda() -> bool:
    return torch.version.cuda is not None

This should not initialize cuda context, either.

It is not safe to assume cuda if it is not neuron.

njhill · 2024-03-19T03:39:38Z

Can we use something like in the setup.py?

@youkaichao I assume this just indicates whether a cuda version of pytorch is in use and so would always return true.

It is not safe to assume cuda if it is not neuron.

I'm not sure what "safe" means here. If cuda/gpu isn't found then it the server will fail to start either way. Just that if it can be checked here it would fail slightly earlier with a nicer message.

Actually is it possible to somehow validate DeviceConfig inside the worker, after we have set CUDA_VISIBLE_DEVICES?

@Yard1 I'm not sure what an easy way to do that would be without nontrivial restructuring. Here all I'm doing is reverting something introduced when the neuron changes were added. We could contemplate that further as a separate improvement?

Yard1 · 2024-03-19T03:47:46Z

Ok sounds good, no blockers from my side

youkaichao · 2024-03-19T03:49:32Z

@youkaichao I assume this just indicates whether a cuda version of pytorch is in use and so would always return true.

I'm not familiar with neuron. When we use neuron, is torch.version.cuda also set?

rkooo567 · 2024-03-19T05:13:42Z

Also @njhill do you happen to know how this DeviceConfig has initialized before forking happens in the CI? I wonder if there's a way to restructure CI to avoid the same problem

zhuohan123

LGTM! I think this is a good temporary fix. Should be changed if we would like to use CPU in the future.

zhuohan123 · 2024-03-19T05:15:05Z

Let me know when this is ready to be merged!

njhill · 2024-03-19T06:04:25Z

Also @njhill do you happen to know how this DeviceConfig has initialized before forking happens in the CI? I wonder if there's a way to restructure CI to avoid the same problem

@rkooo567 I was just speculating that this bug might be the cause of that, but if you're referring to the 19 failing tests in Model Tests then it doesn't look like that's the case since they still fail in the CI for this branch.

@zhuohan123 from my pov it's ready to be merged, thanks!

[BugFix] Avoid initializing CUDA too early

29934ca

Care is taken in the code to avoid initializing CUDA prior to CUDA_VISIBLE_DEVICES being set in the worker, but an instance of this was inadvertently introduced in vllm-project#2569.

njhill mentioned this pull request Mar 19, 2024

[WIP] Remove --forked option #3465

Closed

Yard1 approved these changes Mar 19, 2024

View reviewed changes

zhuohan123 approved these changes Mar 19, 2024

View reviewed changes

rkooo567 mentioned this pull request Mar 19, 2024

[1/n][Chunked Prefill] Refactor input query shapes #3236

Merged

zhuohan123 merged commit 7341c77 into vllm-project:main Mar 19, 2024
31 checks passed

njhill deleted the fix-cuda-init branch March 19, 2024 19:13

sangstar mentioned this pull request Mar 21, 2024

[Bug]: Using pytest --forked when CUDA is not generally fork-safe #3557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Avoid initializing CUDA too early #3487

[BugFix] Avoid initializing CUDA too early #3487

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

youkaichao commented Mar 19, 2024

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

youkaichao commented Mar 19, 2024

rkooo567 commented Mar 19, 2024

zhuohan123 left a comment

zhuohan123 commented Mar 19, 2024

njhill commented Mar 19, 2024

[BugFix] Avoid initializing CUDA too early #3487

[BugFix] Avoid initializing CUDA too early #3487

Conversation

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

youkaichao commented Mar 19, 2024

njhill commented Mar 19, 2024

Yard1 commented Mar 19, 2024

youkaichao commented Mar 19, 2024

rkooo567 commented Mar 19, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 commented Mar 19, 2024

njhill commented Mar 19, 2024