Significant latency between COMPUTE_END and REQUEST_END #6923

danielchalef · 2024-02-26T22:19:18Z

Description
We're seeing significant latency in the order of 300-600 hundred milliseconds between COMPUTE_END and REQUEST_END on a TensorRT-LLM model. See OTEL trace image below.

Triton Information
What version of Triton are you using? 2.42.0

Are you using the Triton container or did you build it yourself? NGC Container 24.01

To Reproduce

Mistral-7B compiled to a TensorRT-LLM engine on an A100 with 2x tensor parallel.
TensorRT-LLM backend version 0.7.1 (as shipped in NGC Container)
Deployed to a 2 GPU A100-80 server.
Dynamic Batching
Redis Response cache
We use the grpc API

See the config.pbtxt below.

Expected behavior
REQUEST_END should occur very shortly after COMPUTE_END, perhaps on the order of tens of ms.

name: "XXXXX"
backend: "tensorrtllm"
max_batch_size: 8

version_policy: { latest: { num_versions: 1}}

model_transaction_policy {
  decoupled: false
}

dynamic_batching {
    max_queue_delay_microseconds: 25000
    priority_levels: 3
    default_priority_level: 2
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]  
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }
]

parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/models/XXXXXXX/1"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: ""
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "guaranteed_no_evict"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "max_num_sequences"
  value: {
    string_value: "16"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "true"
  }
}

parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}

parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "false"
  }
}

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]

response_cache{
  enable:true
}```

The text was updated successfully, but these errors were encountered:

fpetrini15 · 2024-02-27T19:42:00Z

Hi @danielchalef,

Is this unusual for other models you have tested? Is this latency value consistently present across all test runs with this model and other TRT LLM models?

fpetrini15 · 2024-03-11T16:36:27Z

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

fpetrini15 closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant latency between COMPUTE_END and REQUEST_END #6923

Significant latency between COMPUTE_END and REQUEST_END #6923

danielchalef commented Feb 26, 2024 •

edited

Loading

fpetrini15 commented Feb 27, 2024

fpetrini15 commented Mar 11, 2024

Significant latency between COMPUTE_END and REQUEST_END #6923

Significant latency between COMPUTE_END and REQUEST_END #6923

Comments

danielchalef commented Feb 26, 2024 • edited Loading

fpetrini15 commented Feb 27, 2024

fpetrini15 commented Mar 11, 2024

danielchalef commented Feb 26, 2024 •

edited

Loading