Error: terminate called after throwing an instance of 'boost::interprocess::lock_exception' 

### System Info

### Environment
- CPU architecture: x86_64
- GPUs: 4x 80gb H100
- TensorRT-LLM backend v0.10.0
- Docker image: triton_trt_llm (built using docker, from TensorRT-LLM backend option 2)
- Nvidia driver: 535.183.01 
- OS:  Ubuntu 22.04



### Who can help?

@byshiue @kaiyux

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

First cloned the repo and built the TensorRT-LLM Backend image according to [option2](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#option-2-build-via-docker).

Also, downloaded git cloned Mixtral-8x7B-Instruct-v0.1 to /data.

Executed the image with 
```bash
docker run -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend -v /data:/data triton_trt_llm bash
```

Instaled TensorRT-LLM inside the image to generate the model engines:
```bash
cd tensorrt_llm &&
bash docker/common/install_cmake.sh &&
export PATH=/usr/local/cmake/bin:$PATH &&
python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
pip3 install ./build/tensorrt_llm*.whl
```

Generated checkpoints and build engines with tp_size=4
```python
python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir /data/Mixtral-8x7B-Instruct-v0.1 \
                             --output_dir /data/checkpoint_mixtral_tp_4 \
                             --dtype float16 \
                             --tp_size 4

trtllm-build --checkpoint_dir /data/checkpoint_mixtral_tp_4 \
                 --output_dir /data/build_mixtral_tp_4 \
                 --gemm_plugin float16 \
                 --max_batch_size 10
```

Setup up model repository
cp all_models/inflight_batcher_llm/ triton_model_repo -r
cp /data/build_mixtral_tp_4/* triton_model_repo/1

```python

python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/data/Mixtral-8x7B-Instruct-v0.1,triton_max_batch_size:10,preprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/data/Mixtral-8x7B-Instruct-v0.1,triton_max_batch_size:10,postprocessing_instance_count:1
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:10,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:10
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:10,decoupled_mode:False,max_beam_width:1,engine_dir:/data/build_mixtral_tp_4,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
```
Launched Triton server:
```python
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo
``` 

### Expected behavior

Expected to see server working with:
```bash
I0919 14:52:10.475738 293 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0919 14:52:10.475968 293 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
```


### actual behavior

Couldn't launch Triton server. 
Error:
```bash
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
[h100-2:00149] *** Process received signal ***
[h100-2:00149] Signal: Aborted (6)
[h100-2:00149] Signal code:  (-6)
  what():  boost::interprocess::lock_exception
[h100-2:00149] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f71a871e520]
[h100-2:00149] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f71a87729fc]
[h100-2:00149] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f71a871e476]
[h100-2:00149] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f71a87047f3]
[h100-2:00149] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f71a89a7b9e]
[h100-2:00149] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f71a89b320c]
[h100-2:00149] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f71a89b21e9]
[h100-2:00149] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f71a89b2959]
[h100-2:00149] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f71aa7f6884]
[h100-2:00149] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f71aa7f6f41]
[h100-2:00149] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f71a89b34cb]
[h100-2:00149] [11] /opt/tritonserver/backends/python/libtriton_python.so(+0x87bfa)[0x7f719c12cbfa]
[h100-2:00149] [12] /opt/tritonserver/backends/python/libtriton_python.so(+0x7800c)[0x7f719c11d00c]
[h100-2:00149] [13] /opt/tritonserver/backends/python/libtriton_python.so(+0x7ed06)[0x7f719c123d06]
[h100-2:00149] [14] /opt/tritonserver/backends/python/libtriton_python.so(+0x9930a)[0x7f719c13e30a]
[h100-2:00149] [15] /opt/tritonserver/backends/python/libtriton_python.so(+0x853b3)[0x7f719c12a3b3]
[h100-2:00149] [16] /opt/tritonserver/backends/python/libtriton_python.so(+0x3c4c4)[0x7f719c0e14c4]
[h100-2:00149] [17] /opt/tritonserver/backends/python/libtriton_python.so(TRITONBACKEND_ModelInstanceInitialize+0x4ec)[0x7f719c0e1d0c]
[h100-2:00149] [18] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af096)[0x7f71a9124096]
[h100-2:00149] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02d6)[0x7f71a91252d6]
[h100-2:00149] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928e5)[0x7f71a91078e5]
[h100-2:00149] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f26)[0x7f71a9107f26]
[h100-2:00149] [22] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f81d)[0x7f71a911481d]
[h100-2:00149] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f71a8775ee8]
[h100-2:00149] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18afee)[0x7f71a90fffee]
[h100-2:00149] [25] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f71a89e1253]
[h100-2:00149] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f71a8770ac3]
[h100-2:00149] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f71a8801a04]
[h100-2:00149] *** End of error message ***
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
terminate called after throwing an instance of 'boost::interprocess::lock_exception'
  what():  boost::interprocess::lock_exception
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node h100-2 exited on signal 6 (Aborted).
``` 

### additional notes

I also tried with other images to build and serve the model but they didn't work neither. 
Much appreciate any help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error: terminate called after throwing an instance of 'boost::interprocess::lock_exception' #521

System Info

Environment

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error: terminate called after throwing an instance of 'boost::interprocess::lock_exception' #521

Description

System Info

Environment

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions