Skip to content

Assertion failed: Cannot determine if hopper is specialised without a selected config at runtime (Latest commit) #529

@christian-ci

Description

@christian-ci

System Info

  • x86_64
  • NVIDIA L4
  • Built TensorRT-LLM backend using:
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
  • TensorRT-LLM version: 0.12.0.dev2024070900
    with Commit 6053a5dcf49a51174bcb28b598856c421e896e29
  • Engine built:
trtllm-build --checkpoint_dir /models/mixtral-8x7b-instruct-fp8-tp4-quant \
             --output_dir /models/mixtral-8x7b-instruct-engine-fp8-tp4 \
             --use_custom_all_reduce disable \
             --use_fp8_context_fmha disable \
             --workers 4

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build Image using latest commit 6053a5dcf49a51174bcb28b598856c421e896e29 on NVIDIA L4 GPU x8 (AWS g6.48xlarge) Using Ubuntu Deep Learning GPU AMI. NVIDIA Deep Learning AMI No support for G6

DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Quantize mixtral-8x7b-instruct to fp8 and then build engine:

trtllm-build --checkpoint_dir /models/mixtral-8x7b-instruct-fp8-tp4-quant \
             --output_dir /models/mixtral-8x7b-instruct-engine-fp8-tp4 \
             --use_custom_all_reduce disable \
             --use_fp8_context_fmha disable \
             --workers 4

Adjust the config files as needed in decoupled_mode with tokenizer and inflight-fused-batching plus max_utlilization
Load model and run.
Then try to run inference.

Expected behavior

Runs Inference

actual behavior

I0709 16:10:56.837763 111 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: Cannot determine if hopper is specialised without a selected config (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_template.h:427)
1 0x71c34c806661 /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10(+0x80661) [0x71c34c806661]
2 0x71c34d1e0480 /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10(+0xa5a480) [0x71c34d1e0480]
3 0x71c34d1e04e1 /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10(+0xa5a4e1) [0x71c34d1e04e1]
4 0x71c34ca0013f /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10(+0x27a13f) [0x71c34ca0013f]
5 0x71c34ca1417d /app/tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.10(+0x28e17d) [0x71c34ca1417d]
6 0x71c34c8da42d tensorrt_llm::plugins::MixtureOfExpertsPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 925
7 0x71c319fcfa8c /usr/local/tensorrt/lib/libnvinfer.so.10(+0x109fa8c) [0x71c319fcfa8c]
8 0x71c319f74657 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1044657) [0x71c319f74657]
9 0x71c319f760c1 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10460c1) [0x71c319f760c1]
10 0x71c376f4e9d4 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
11 0x71c376f5c72b tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 235
12 0x71c376f5c896 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 246
13 0x71c376f5d219 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2409
14 0x71c376f7d9fb tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 411
15 0x71c376f8124d tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
16 0x71c517bdb253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x71c517bdb253]
17 0x71c51796aac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x71c51796aac3]
18 0x71c5179fba04 clone + 68
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'

additional notes

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions