Skip to content

launch multi-gpu triton server and got an Error #524

@dwq370

Description

@dwq370

System Info

4*NVIDIA L20

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

step1: create a container
docker run -it --gpus '"device=0,1,2,3"' --ipc=host --network=host --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864
-v /home/qxzg/workspace/tensorrt-llm-triton/workload:/workspace -v /home/qxzg/workspace/tensorrt-llm-triton/data:/models
-v /home/qxzg/.cache/modelscope/hub/qwen/Qwen2-72B-Instruct:/models/hf_models/qwen2-instruct-72b
-p 19000:9000 \ triton_tensorrtllm_backend:v0.10.0 bash

step2: set environment variable
export HF_MODEL_DIR=/models/hf_models/qwen2-instruct-72b
export TMP_CHECKPOINT_DIR=/models/trt_output/checkpoints/Qwen2-72B/
export TRT_ENGINE_DIR=/models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4
export MAX_BATCH_SIZE=1
mkdir -p ${TRT_ENGINE_DIR} ${TMP_CHECKPOINT_DIR}

step3: convert model
python /workspace/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py
--model_dir ${HF_MODEL_DIR}
--dtype float16
--output_dir ${TMP_CHECKPOINT_DIR}
--tp_size 4

step4: build engine
trtllm-build --checkpoint_dir $TMP_CHECKPOINT_DIR
--output_dir $TRT_ENGINE_DIR
--gemm_plugin float16
--strongly_typed
--paged_kv_cache enable
--remove_input_padding enable
--max_input_len=16384 \
--max_output_len=512
--max_batch_size 1
--tp_size=4

get the engine path as:
root@ubuntu-H3C-UniServer-R4900-G5:/app# ll /models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4/
total 149414208
drwxr-xr-x 2 root root 4096 Jul 5 05:47 ./
drwxr-xr-x 3 root root 4096 Jul 5 03:02 ../
-rw-r--r-- 1 root root 5637 Jul 5 04:59 config.json
-rw-r--r-- 1 root root 38250028580 Jul 5 05:00 rank0.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:03 rank1.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:06 rank2.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:09 rank3.engine

step5: create model repository
cp -rf /workspace/tensorrtllm_backend/all_models/inflight_batcher_llm/ /app/qwen_ifb
cp -rf /workspace/tensorrtllm_backend/tools/fill_template.py /app/

python3 /app/fill_template.py -i /app/qwen_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
python3 /app/fill_template.py -i /app/qwen_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 /app/fill_template.py -i /app/qwen_ifb/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${TRT_ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

step6: launch triton server
python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=/app/qwen_ifb/ --http_port 18000 --grpc_port 18001 --metrics_port 18002 --log --log-file ./triton_log.txt

Expected behavior

create triton server success

actual behavior

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0705 08:07:40.119730 79387 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.119915 79385 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.120172 79387 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0705 08:07:40.120320 79385 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0705 08:07:40.120417 79387 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.120489 79385 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.120495 79387 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+

I0705 08:07:40.120545 79385 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+

I0705 08:07:40.121322 79386 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.121743 79386 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0705 08:07:40.121919 79386 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.122013 79386 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+

I0705 08:07:40.179571 79387 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.179605 79387 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.179612 79387 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.179618 79387 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.192136 79386 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.192164 79386 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.192172 79386 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.192178 79386 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.199299 79387 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.199513 79387 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0705 08:07:40.199965 79385 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.200012 79385 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.200023 79385 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.200033 79385 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.201625 79387 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:18001
I0705 08:07:40.201876 79387 http_server.cc:4636] Started HTTPService at 0.0.0.0:18000
I0705 08:07:40.217713 79386 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.217959 79386 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0705 08:07:40.219476055 79386 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.219384819+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.219362598+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.21932451+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.21928166+00:00"}]}, UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.219357544+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.219352483+00:00"}]}]}]}
E0705 08:07:40.219789 79386 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use
I0705 08:07:40.229760 79385 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.230178 79385 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0705 08:07:40.232606618 79385 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.232429669+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.232381555+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.232300483+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.232221534+00:00"}]}, UNKNOWN:Unable to configure socket {created_time:"2024-07-05T08:07:40.232372083+00:00", fd:180, children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.232360498+00:00"}]}]}]}
E0705 08:07:40.232992 79385 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use

I0705 08:07:40.243374 79387 http_server.cc:320] Started Metrics Service at 0.0.0.0:18002
Cleaning up...
Cleaning up...
Cleaning up...
error: creating server: Internal - failed to load all models

additional notes

no progress use port 18000,18001and 18002 before launch triton server in the container

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions