Skip to content

v0.11.0 release fails when TP>1  #564

@daulet

Description

@daulet

System Info

  • CPU: x86_64
  • GPUs: 8x H100 80GB HBM3
  • Driver: 550.90.07
  • CUDA: 12.4
  • TensorRT-LLM: v0.11.0
  • tensorrtllm_backend: v0.11.0

Who can help?

@kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build an engine with TP>1 (I've tried gpt2 and llama3 8b, llama3.1 8b with TP2, TP4 so far) and start triton with inflight batching config. The issue seems to be specific to TP>1 and this backend, because I've ruled out:

  • TP=1 with corresponding engine boots successfully, with the same config (only had to change gpu_device_ids);
  • TP>1 engine can be loaded and generate output using TensorRT-LLM examples/run.py script;

Hence it is likely an issue with tensorrtllm_backend + mpi.

Expected behavior

Server should boot up successfully. Using a supported model, official release, no customizations.

actual behavior

[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSequenceLen. Therefore, it has been adjusted to match the value of mMaxSequenceLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3849 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3845 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[192-222-52-240:00289] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Triton logs:

I0812 04:47:11.717997 43252 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f9500000000' with size 268435456"
I0812 04:47:11.757600 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 04:47:11.757689 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 04:47:11.757760 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 04:47:11.760211 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 04:47:13.811956 43252 model_lifecycle.cc:472] "loading: tensorrt_llm:1"

additional notes

It seems TP is straight up broken in v0.11 release, since I've setup everything according to documented steps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions