Skip to content

unable to launch model with tensorrt_llm #579

@janpetrov

Description

@janpetrov

System Info

Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
version 0.13.0.dev2024081300 (the same issue an earlier 0.12.0dev....)
Driver Version: 545.23.08 CUDA Version: 12.3 (the same issue for Driver Version: 555.42.06 and CUDA Version: 12.5)
H100-PCI (the same issue for H100-SXM)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

We compiled and attempted to launchvarious 7b / 8b models, but when starting the triton server we always get:

[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049] Non-graceful termination detected.
I0821 08:15:33.559462 2831 pb_stub.cc:2049] Non-graceful termination detected.

Expected behavior

the triton server would load and start to serve

actual behavior

I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049] Non-graceful termination detected.
I0821 08:15:33.559462 2831 pb_stub.cc:2049] Non-graceful termination detected.

additional notes

everything ran fine up to the tagged version 0.11.0

The issue seems to be specific for tensorrt llm backend as the following script that loads the very same triton model runs fine

from tensorrt_llm.runtime import ModelRunner, SamplingConfig
from transformers import AutoTokenizer

tokenizer_path = "{SOME_PATH}"
dir_path = "{SOME_PATH}"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
runner = ModelRunner.from_dir(dir_path)
sc = SamplingConfig(pad_id=tokenizer.eos_token_id, end_id=tokenizer.eos_token_id)

tokenized = [tokenizer("Hello, my dog is cute", return_tensors="pt")["input_ids"][0]]
res = runner.generate(batch_input_ids=tokenized, sampling_config=sc)
print(tokenizer.decode(res[0][0], skip_special_tokens=True))

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions