-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
4*NVIDIA L20
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
step1: create a container
docker run -it --gpus '"device=0,1,2,3"' --ipc=host --network=host --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864
-v /home/qxzg/workspace/tensorrt-llm-triton/workload:/workspace -v /home/qxzg/workspace/tensorrt-llm-triton/data:/models
-v /home/qxzg/.cache/modelscope/hub/qwen/Qwen2-72B-Instruct:/models/hf_models/qwen2-instruct-72b
-p 19000:9000 \ triton_tensorrtllm_backend:v0.10.0 bash
step2: set environment variable
export HF_MODEL_DIR=/models/hf_models/qwen2-instruct-72b
export TMP_CHECKPOINT_DIR=/models/trt_output/checkpoints/Qwen2-72B/
export TRT_ENGINE_DIR=/models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4
export MAX_BATCH_SIZE=1
mkdir -p ${TRT_ENGINE_DIR} ${TMP_CHECKPOINT_DIR}
step3: convert model
python /workspace/tensorrtllm_backend/tensorrt_llm/examples/qwen/convert_checkpoint.py
--model_dir ${HF_MODEL_DIR}
--dtype float16
--output_dir ${TMP_CHECKPOINT_DIR}
--tp_size 4
step4: build engine
trtllm-build --checkpoint_dir $TMP_CHECKPOINT_DIR
--output_dir $TRT_ENGINE_DIR
--gemm_plugin float16
--strongly_typed
--paged_kv_cache enable
--remove_input_padding enable
--max_input_len=16384 \
--max_output_len=512
--max_batch_size 1
--tp_size=4
get the engine path as:
root@ubuntu-H3C-UniServer-R4900-G5:/app# ll /models/trt_output/engines/Qwen2-72B/trt_engines/fp16/tp4/
total 149414208
drwxr-xr-x 2 root root 4096 Jul 5 05:47 ./
drwxr-xr-x 3 root root 4096 Jul 5 03:02 ../
-rw-r--r-- 1 root root 5637 Jul 5 04:59 config.json
-rw-r--r-- 1 root root 38250028580 Jul 5 05:00 rank0.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:03 rank1.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:06 rank2.engine
-rw-r--r-- 1 root root 38250028580 Jul 5 05:09 rank3.engine
step5: create model repository
cp -rf /workspace/tensorrtllm_backend/all_models/inflight_batcher_llm/ /app/qwen_ifb
cp -rf /workspace/tensorrtllm_backend/tools/fill_template.py /app/
python3 /app/fill_template.py -i /app/qwen_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
python3 /app/fill_template.py -i /app/qwen_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_DIR},tokenizer_type:auto,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 /app/fill_template.py -i /app/qwen_ifb/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
python3 /app/fill_template.py -i /app/qwen_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${TRT_ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
step6: launch triton server
python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=/app/qwen_ifb/ --http_port 18000 --grpc_port 18001 --metrics_port 18002 --log --log-file ./triton_log.txt
Expected behavior
create triton server success
actual behavior
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 4, rank: 2
[TensorRT-LLM][INFO] MPI size: 4, rank: 3
[TensorRT-LLM][INFO] MPI size: 4, rank: 0
[TensorRT-LLM][INFO] MPI size: 4, rank: 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2560
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 36478 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 1312.07 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 36468 (MiB)
[TensorRT-LLM][WARNING] Both freeGpuMemoryFraction (aka kv_cache_free_gpu_mem_fraction) and maxTokens (aka max_tokens_in_paged_kv_cache) are set (to 0.500000 and 2560, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 40
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 2560. Allocating 209715200 bytes.
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
I0705 08:07:40.119730 79387 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.119915 79385 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.120172 79387 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0705 08:07:40.120320 79385 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0705 08:07:40.120417 79387 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0705 08:07:40.120489 79385 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0705 08:07:40.120495 79387 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+
I0705 08:07:40.120545 79385 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+
I0705 08:07:40.121322 79386 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm'
I0705 08:07:40.121743 79386 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0705 08:07:40.121919 79386 server.cc:634]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0705 08:07:40.122013 79386 server.cc:677]
+--------------+---------+--------+
| Model | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1 | READY |
+--------------+---------+--------+
I0705 08:07:40.179571 79387 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.179605 79387 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.179612 79387 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.179618 79387 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.192136 79386 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.192164 79386 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.192172 79386 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.192178 79386 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.199299 79387 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.199513 79387 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0705 08:07:40.199965 79385 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA L20
I0705 08:07:40.200012 79385 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA L20
I0705 08:07:40.200023 79385 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA L20
I0705 08:07:40.200033 79385 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA L20
I0705 08:07:40.201625 79387 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:18001
I0705 08:07:40.201876 79387 http_server.cc:4636] Started HTTPService at 0.0.0.0:18000
I0705 08:07:40.217713 79386 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.217959 79386 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
E0705 08:07:40.219476055 79386 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.219384819+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.219362598+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.21932451+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.21928166+00:00"}]}, UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.219357544+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.219352483+00:00"}]}]}]}
E0705 08:07:40.219789 79386 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use
I0705 08:07:40.229760 79385 metrics.cc:770] Collecting CPU metrics
I0705 08:07:40.230178 79385 tritonserver.cc:2538]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.44.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /app/qwen_ifb/ |
| model_control_mode | MODE_EXPLICIT |
| startup_models_0 | tensorrt_llm |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
E0705 08:07:40.232606618 79385 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:18001' {created_time:"2024-07-05T08:07:40.232429669+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-07-05T08:07:40.232381555+00:00", children:[UNKNOWN:Unable to configure socket {fd:180, created_time:"2024-07-05T08:07:40.232300483+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.232221534+00:00"}]}, UNKNOWN:Unable to configure socket {created_time:"2024-07-05T08:07:40.232372083+00:00", fd:180, children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-07-05T08:07:40.232360498+00:00"}]}]}]}
E0705 08:07:40.232992 79385 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:18001' already in use
I0705 08:07:40.243374 79387 http_server.cc:320] Started Metrics Service at 0.0.0.0:18002
Cleaning up...
Cleaning up...
Cleaning up...
error: creating server: Internal - failed to load all models
additional notes
no progress use port 18000,18001and 18002 before launch triton server in the container