v0.11.0 release fails when TP>1 

### System Info

- CPU: x86_64
- GPUs: 8x H100 80GB HBM3
- Driver: 550.90.07
- CUDA: 12.4
- TensorRT-LLM: v0.11.0
- tensorrtllm_backend: v0.11.0

### Who can help?

@kaiyux

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Build an engine with TP>1 (I've tried gpt2 and llama3 8b, llama3.1 8b with TP2, TP4 so far) and start triton with inflight batching config. The issue seems to be specific to TP>1 and this backend, because I've ruled out:
* TP=1 with corresponding engine boots successfully, with the same config (only had to change `gpu_device_ids`);
* TP>1 engine can be loaded and generate output using TensorRT-LLM `examples/run.py` script;

Hence it is likely an issue with tensorrtllm_backend + mpi.

### Expected behavior

Server should boot up successfully. Using a supported model, official release, no customizations.

### actual behavior

```
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSequenceLen. Therefore, it has been adjusted to match the value of mMaxSequenceLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 4096
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3849 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3845 MiB
[TensorRT-LLM][INFO] Loaded engine size: 3844 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[192-222-52-240:00289] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[192-222-52-240:00289] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
```

Triton logs:
```
I0812 04:47:11.717997 43252 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f9500000000' with size 268435456"
I0812 04:47:11.757600 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0812 04:47:11.757689 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0812 04:47:11.757760 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0812 04:47:11.760211 43252 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0812 04:47:13.811956 43252 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
```

### additional notes

It seems TP is straight up broken in v0.11 release, since I've setup everything according to documented steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.11.0 release fails when TP>1 #564

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

v0.11.0 release fails when TP>1 #564

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions