-
Notifications
You must be signed in to change notification settings - Fork 133
Description
I am trying to bring up model ensemble with Baichuan trt engine.
with modified tokenizer inside the preprocess and postprocess, trtionserver can load full pipline, but processing client request raise a Segmentation fault in the server side. Could you please take a look? thanks.
- Used
Baichuan2-13B-Chatmodel from HF. - use
TensorRT-LLM v0.5.0(https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/baichuan) to convert model to trt
export DTYPE=bf16
export TP=1
export PP=1
export MAX_BATCH_SIZE=8
export MAX_INPUT_LEN=512
python3 build.py --model_version v2_13b \
--model_dir $MODEL_DIR \
--dtype ${DTYPE} \
--use_gemm_plugin ${DTYPE} \
--use_gpt_attention_plugin ${DTYPE} \
--use_inflight_batching \
--remove_input_padding \
--enable_context_fmha \
--paged_kv_cache \
--max_batch_size ${MAX_BATCH_SIZE} \
--max_input_len ${MAX_INPUT_LEN} \
--max_output_len ${MAX_INPUT_LEN} \
--world_size 1 \
--output_dir $TARGET_DIR
-
modified the tokenizer inside the preprocess and postprocess ; see model.py
-
adjust package version inside the
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3:pip3 install tokenizers==0.13.3 -
started the
/opt/tritonserver/bin/tritonserver, and send a curl request from client side. Server side got following error:
I1105 03:11:25.675011 587 infer_request.cc:117] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I1105 03:11:25.675204 587 infer_request.cc:117] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I1105 03:11:25.675198 587 infer_request.cc:117] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I1105 03:11:25.675244 587 python_be.cc:2321] TRITONBACKEND_ModelInstanceExecute: model instance name preprocessing_0_0 released 1 requests
I1105 03:11:25.675329 587 libtensorrtllm.cc:91] ModelInstanceState::getRequestBooleanInputTensor: user did not not provide stop input for the request
I1105 03:11:25.675380 587 libtensorrtllm.cc:91] ModelInstanceState::getRequestBooleanInputTensor: user did not not provide streaming input for the request
I1105 03:11:25.675408 587 infer_request.cc:117] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I1105 03:11:25.675432 587 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7fd98a0000c0
I1105 03:11:25.675454 587 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7fd98a0000f0
I1105 03:11:25.675476 587 pinned_memory_manager.cc:191] pinned memory deallocation: addr 0x7fd98a000090
Signal (11) received.
0# 0x000055B12500C13D in /opt/tritonserver/bin/tritonserver
1# 0x00007FDBE33A2520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# 0x00007FD96287DBB0 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
3# 0x00007FD962849E07 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
4# 0x00007FD962851008 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
5# 0x00007FD962851722 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
6# 0x00007FD96283B241 in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
7# 0x00007FD96283C38A in /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
8# 0x00007FDBE3664253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
9# 0x00007FDBE33F4AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
10# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Segmentation fault (core dumped)
root@smc:/tensorrtllm_backend#
root@smc:/tensorrtllm_backend# I1105 03:11:27.178736 599 pb_stub.cc:1815] Non-graceful termination detected.
I1105 03:11:27.257417 613 pb_stub.cc:1815] Non-graceful termination detected.checked the src code in libtensorrtllm.cc, and the two warming of unset variable should work fine.
Any directions on fixing Segmentation fault issue? thanks.