-
Notifications
You must be signed in to change notification settings - Fork 132
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
- CPU: x86_64
- GPUs: 8x H100 80GB HBM3
- Driver: 550.90.07
- CUDA: 12.4
- TensorRT-LLM: v0.11.0
- tensorrtllm_backend: v0.11.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Build an engine with TP>1 (I've tried gpt2 and llama3 8b, llama3.1 8b with TP2, TP4 so far) and start triton with inflight batching config. The issue seems to be specific to TP>1 and this backend, because I've ruled out:
- TP=1 with corresponding engine boots successfully, with the same config (only had to change
gpu_device_ids); - TP>1 engine can be loaded and generate output using TensorRT-LLM
examples/run.pyscript;
Hence it is likely an issue with tensorrtllm_backend + mpi.
Expected behavior
Server should boot up successfully. Using a supported model, official release, no customizations.
actual behavior
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSequenceLen. Therefore, it has been adjusted to match the value of mMaxSequenceLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3849 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3845 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[192-222-52-240:00289] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Triton logs:
I0812 04:47:11.717997 43252 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f9500000000' with size 268435456"
I0812 04:47:11.757600 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 04:47:11.757689 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 04:47:11.757760 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 04:47:11.760211 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 04:47:13.811956 43252 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
additional notes
It seems TP is straight up broken in v0.11 release, since I've setup everything according to documented steps.
dhruvmullick and jasonngap1
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working