Skip to content

Error: terminate called after throwing an instance of 'boost::interprocess::lock_exception'  #521

@Pedrochem

Description

@Pedrochem

System Info

Environment

  • CPU architecture: x86_64
  • GPUs: 4x 80gb H100
  • TensorRT-LLM backend v0.10.0
  • Docker image: triton_trt_llm (built using docker, from TensorRT-LLM backend option 2)
  • Nvidia driver: 535.183.01
  • OS: Ubuntu 22.04

Who can help?

@byshiue @kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

First cloned the repo and built the TensorRT-LLM Backend image according to option2.

Also, downloaded git cloned Mixtral-8x7B-Instruct-v0.1 to /data.

Executed the image with

docker run -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend -v /data:/data triton_trt_llm bash

Instaled TensorRT-LLM inside the image to generate the model engines:

cd tensorrt_llm &&
bash docker/common/install_cmake.sh &&
export PATH=/usr/local/cmake/bin:$PATH &&
python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
pip3 install ./build/tensorrt_llm*.whl

Generated checkpoints and build engines with tp_size=4

python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /data/Mixtral-8x7B-Instruct-v0.1 \
                             --output_dir /data/checkpoint_mixtral_tp_4 \
                             --dtype float16 \
                             --tp_size 4

trtllm-build --checkpoint_dir /data/checkpoint_mixtral_tp_4 \
                 --output_dir /data/build_mixtral_tp_4 \
                 --gemm_plugin float16 \
                 --max_batch_size 10

Setup up model repository
cp all_models/inflight_batcher_llm/ triton_model_repo -r
cp /data/build_mixtral_tp_4/* triton_model_repo/1

python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/data/Mixtral-8x7B-Instruct-v0.1,triton_max_batch_size:10,preprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/data/Mixtral-8x7B-Instruct-v0.1,triton_max_batch_size:10,postprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:10,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:10
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:10,decoupled_mode:False,max_beam_width:1,engine_dir:/data/build_mixtral_tp_4,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Launched Triton server:

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo

Expected behavior

Expected to see server working with:

I0919 14:52:10.475738 293 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0919 14:52:10.475968 293 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

actual behavior

Couldn't launch Triton server.
Error:

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
[h100-2:00149] *** Process received signal ***
[h100-2:00149] Signal: Aborted (6)
[h100-2:00149] Signal code:  (-6)
  what():  boost::interprocess::lock_exception
[h100-2:00149] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f71a871e520]
[h100-2:00149] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f71a87729fc]
[h100-2:00149] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f71a871e476]
[h100-2:00149] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f71a87047f3]
[h100-2:00149] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f71a89a7b9e]
[h100-2:00149] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f71a89b320c]
[h100-2:00149] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f71a89b21e9]
[h100-2:00149] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f71a89b2959]
[h100-2:00149] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f71aa7f6884]
[h100-2:00149] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f71aa7f6f41]
[h100-2:00149] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f71a89b34cb]
[h100-2:00149] [11] /opt/tritonserver/backends/python/libtriton_python.so(+0x87bfa)[0x7f719c12cbfa]
[h100-2:00149] [12] /opt/tritonserver/backends/python/libtriton_python.so(+0x7800c)[0x7f719c11d00c]
[h100-2:00149] [13] /opt/tritonserver/backends/python/libtriton_python.so(+0x7ed06)[0x7f719c123d06]
[h100-2:00149] [14] /opt/tritonserver/backends/python/libtriton_python.so(+0x9930a)[0x7f719c13e30a]
[h100-2:00149] [15] /opt/tritonserver/backends/python/libtriton_python.so(+0x853b3)[0x7f719c12a3b3]
[h100-2:00149] [16] /opt/tritonserver/backends/python/libtriton_python.so(+0x3c4c4)[0x7f719c0e14c4]
[h100-2:00149] [17] /opt/tritonserver/backends/python/libtriton_python.so(TRITONBACKEND_ModelInstanceInitialize+0x4ec)[0x7f719c0e1d0c]
[h100-2:00149] [18] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af096)[0x7f71a9124096]
[h100-2:00149] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02d6)[0x7f71a91252d6]
[h100-2:00149] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928e5)[0x7f71a91078e5]
[h100-2:00149] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f26)[0x7f71a9107f26]
[h100-2:00149] [22] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f81d)[0x7f71a911481d]
[h100-2:00149] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f71a8775ee8]
[h100-2:00149] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18afee)[0x7f71a90fffee]
[h100-2:00149] [25] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f71a89e1253]
[h100-2:00149] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f71a8770ac3]
[h100-2:00149] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f71a8801a04]
[h100-2:00149] *** End of error message ***
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node h100-2 exited on signal 6 (Aborted).

additional notes

I also tried with other images to build and serve the model but they didn't work neither.
Much appreciate any help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions