unable to launch model with tensorrt_llm

### System Info

Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
version 0.13.0.dev2024081300 (the same issue an earlier 0.12.0dev....)
Driver Version: 545.23.08  CUDA Version: 12.3 (the same issue for Driver Version: 555.42.06 and CUDA Version: 12.5)
H100-PCI (the same issue for H100-SXM)

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

We compiled and attempted to launchvarious 7b / 8b models, but when starting the triton server we always get:

[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049]  Non-graceful termination detected. 
I0821 08:15:33.559462 2831 pb_stub.cc:2049]  Non-graceful termination detected. 

### Expected behavior

the triton server would load and start to serve

### actual behavior

I0821 08:15:32.609876 2817 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
Floating point exception (core dumped)
root@honzap-ct121:/mnt/ng/triton# I0821 08:15:33.549090 2830 pb_stub.cc:2049]  Non-graceful termination detected. 
I0821 08:15:33.559462 2831 pb_stub.cc:2049]  Non-graceful termination detected. 

### additional notes

everything ran fine up to the tagged version 0.11.0

The issue seems to be specific for tensorrt llm backend as the following script that loads the very same triton model runs fine

``` python
from tensorrt_llm.runtime import ModelRunner, SamplingConfig
from transformers import AutoTokenizer

tokenizer_path = "{SOME_PATH}"
dir_path = "{SOME_PATH}"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
runner = ModelRunner.from_dir(dir_path)
sc = SamplingConfig(pad_id=tokenizer.eos_token_id, end_id=tokenizer.eos_token_id)

tokenized = [tokenizer("Hello, my dog is cute", return_tensors="pt")["input_ids"][0]]
res = runner.generate(batch_input_ids=tokenized, sampling_config=sc)
print(tokenizer.decode(res[0][0], skip_special_tokens=True))
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unable to launch model with tensorrt_llm #579

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unable to launch model with tensorrt_llm #579

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions