-
Notifications
You must be signed in to change notification settings - Fork 132
Description
System Info
Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
version 0.13.0.dev2024081300 (the same issue an earlier 0.12.0dev....)
Driver Version: 545.23.08 CUDA Version: 12.3 (the same issue for Driver Version: 555.42.06 and CUDA Version: 12.5)
H100-PCI (the same issue for H100-SXM)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
We compiled and attempted to launchvarious 7b / 8b models, but when starting the triton server we always get:
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049] Non-graceful termination detected.
I0821 08:15:33.559462 2831 pb_stub.cc:2049] Non-graceful termination detected.
Expected behavior
the triton server would load and start to serve
actual behavior
I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049] Non-graceful termination detected.
I0821 08:15:33.559462 2831 pb_stub.cc:2049] Non-graceful termination detected.
additional notes
everything ran fine up to the tagged version 0.11.0
The issue seems to be specific for tensorrt llm backend as the following script that loads the very same triton model runs fine
from tensorrt_llm.runtime import ModelRunner, SamplingConfig
from transformers import AutoTokenizer
tokenizer_path = "{SOME_PATH}"
dir_path = "{SOME_PATH}"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
runner = ModelRunner.from_dir(dir_path)
sc = SamplingConfig(pad_id=tokenizer.eos_token_id, end_id=tokenizer.eos_token_id)
tokenized = [tokenizer("Hello, my dog is cute", return_tensors="pt")["input_ids"][0]]
res = runner.generate(batch_input_ids=tokenized, sampling_config=sc)
print(tokenizer.decode(res[0][0], skip_special_tokens=True))