Triton/vllm_backend launches model on incorrect GPU #7349

tc8 · 2024-06-13T06:17:12Z

Description
My issue is similar to triton-inference-server/tensorrtllm_backend#481 except it's for vllm.

I have the following config.pbtxt

    backend: "vllm"
    instance_group [
      {
        count: 1,
        kind: KIND_GPU,
        gpus: [1]
      }
    ]

When I check nvidia-smi, I get

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   43C    P0              74W / 300W |  28465MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   38C    P0              61W / 300W |    681MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+v

Note that I requested the model to run on GPU 1 but nvidia smi shows that it's running on GPU 0. This is a problem because I have 2 GPUs and I want to run a model on each GPU, but vllm/Triton tries to run both on the same GPU (#0) and causes a CUDA OOM error, which seems related to #6855

I saw that latest commit in main had made changes related to CUDA devices so I have patched my Triton server with this latest code. The logs show that it's using device 1 but for some reason this doesn't seem to actually work
I0613 06:03:52.276264 1 model.py:166] "Detected KIND_GPU model instance, explicitly setting GPU device=1 for vllm_model_1"

Triton Information
24.05

Are you using the Triton container or did you build it yourself?
Container from NCC.

Note that I patched the backend code.

To Reproduce

Run tritonserver with patched vlllm_backend
Use config.pbtxt

    backend: "vllm"
    instance_group [
      {
        count: 1,
        kind: KIND_GPU,
        gpus: [1]
      }
    ]

Check nvidia-smi

Expected behavior
Model to be loaded on gpu 1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton/vllm_backend launches model on incorrect GPU #7349

Triton/vllm_backend launches model on incorrect GPU #7349

tc8 commented Jun 13, 2024

Triton/vllm_backend launches model on incorrect GPU #7349

Triton/vllm_backend launches model on incorrect GPU #7349

Comments

tc8 commented Jun 13, 2024