Skip to content

3rd Tritonserver fails to respond  #509

@njaramish

Description

@njaramish

System Info

8xH100 node, deploying each server inside its own Docker container

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Using TensorRT-LLM v0.10.0. Image is built from the tensorrtllm_backend repo using docker.

  1. Compile engines
python convert_checkpoint.py --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --tp_size 2
                                
trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers 2 \
                 --tp_size 2 \
                 --pp_size 1 \
                 --gpt_attention_plugin float16 \
                 --context_fmha enable \
                 --remove_input_padding enable \
                 --use_custom_all_reduce disable \
                 --paged_kv_cache enable \
                 --use_paged_context_fmha disable \
                 --max_input_len 32768  \
                 --max_batch_size 10 \
                 --max_output_len 1024 \
                 --max_beam_width 1 \
                 --max_num_tokens 65544
  1. Launch servers, changing the ports and cards they use, similar to the commands below, but running each command in separate docker container which is started with access to a different set of 2 GPUs:
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8000 --grpc_port 8001 --metrics_port 8002
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8010 --grpc_port 8011 --metrics_port 8012
python3 scripts/launch_triton_server.py --world_size 2 --model_repo={deploy_dir} --http_port 8020 --grpc_port 8021 --metrics_port 8022
  1. curl localhos:8020/v2/health/ready

Expected behavior

Given the tutorial here, I thought that I would be able to run a number of Tritonservers with tensorrtllm_backend on a given node, provided that each Tritonserver has its own allocation of GPUs and ports, and that each server would be responsive to requests.

actual behavior

Launching one container running one Tritonserver works as expected, as does launching a second container running another Tritonserver (on different ports and GPUs). However, launching a 3rd container with a 3rd Tritonserver results with the 3rd Tritonserver not being responsive:

curl localhost:8020/v2/health/ready
curl: (56) Recv failure: Connection reset by peer

Sometimes, the request goes through and a healthy response is returned. Similarly, I am sometimes able to get a response from the ensemble model, but other times I get the connection reset error.

The two Tritonservers that are spun up first always work. Starting with no containers or servers running, if I spin up the 1st and 3rd port-gpu configuration first, those two Tritonservers work, while the 2nd port-gpu configuration does not.

additional notes

I see that the example here only launches two Tritonservers. Does the tensorrtllm_backend support launching many (i.e. more than 2) Tritonservers on the same node?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions