3rd Tritonserver fails to respond 

### System Info

8xH100 node, deploying each server inside its own Docker container

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Using TensorRT-LLM v0.10.0. Image is built from the `tensorrtllm_backend` repo using docker. 


1. Compile engines
```
python convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --tp_size 2
                                
trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers 2 \
                 --tp_size 2 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce disable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 32768  \
                 --max_batch_size 10 \
                 --max_output_len 1024 \
                 --max_beam_width 1 \
                 --max_num_tokens 65544
```
2. Launch servers, changing the ports and cards they use, similar to the commands below, but running each command in separate docker container which is started with access to a different set of 2 GPUs:
```
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8000 --grpc_port 8001 --metrics_port 8002
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8010 --grpc_port 8011 --metrics_port 8012
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8020 --grpc_port 8021 --metrics_port 8022
```
3. `curl localhos:8020/v2/health/ready`

### Expected behavior

Given the tutorial [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus), I thought that I would be able to run a number of Tritonservers with `tensorrtllm_backend` on a given node, provided that each Tritonserver has its own allocation of GPUs and ports, and that each server would be responsive to requests.

### actual behavior

Launching one container running one Tritonserver works as expected, as does launching a second container running another Tritonserver (on different ports and GPUs). However, launching a 3rd container with a 3rd Tritonserver results with the 3rd Tritonserver not being responsive:
```
curl localhost:8020/v2/health/ready
curl: (56) Recv failure: Connection reset by peer
```

Sometimes, the request goes through and a healthy response is returned. Similarly, I am sometimes able to get a response from the ensemble model, but other times I get the connection reset error. 

The two Tritonservers that are spun up first always work. Starting with no containers or servers running, if I spin up the 1st and 3rd port-gpu configuration first, those two Tritonservers work, while the 2nd port-gpu configuration does not. 



### additional notes

I see that the example [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus) only launches two Tritonservers. Does the `tensorrtllm_backend` support launching many (i.e. more than 2) Tritonservers on the same node? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3rd Tritonserver fails to respond #509

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

3rd Tritonserver fails to respond #509

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions