triton pytorch backend malloc coredump #4778

jackzhou121 · 2022-08-17T06:42:31Z

Description
I use triton sdk to do torchscript model inference, i use two process with nvidia-mps, and i found sometimes one of two process failed with coredump, i use gdb to debug the problem, here is the bt info:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1df8ef823b in malloc () from /usr/lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f1c817fe000 (LWP 126))]
(gdb) bt
#0 0x00007f1df8ef823b in malloc () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f1df9266b39 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f1df1a21c1c in void std::vector<torch::jit::Value*, std::allocatortorch::jit::Value* >::emplace_backtorch::jit::Value*(torch::jit::Value*&&) ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#3 0x00007f1df1acffee in torch::jit::Node::addOutput() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#4 0x00007f1df1ad76c5 in torch::jit::Block::cloneFrom(torch::jit::Block*, std::function<torch::jit::Value* (torch::jit::Value*)>) () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#5 0x00007f1df1ad7f84 in torch::jit::Graph::copy() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#6 0x00007f1df19a8724 in torch::jit::GraphFunction::get_executor() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#7 0x00007f1df19a579e in torch::jit::GraphFunction::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#8 0x00007f1df19a5c5e in torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, c10::IValue> > > const&) ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#9 0x00007f1df19b84bb in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, c10::IValue> > > const&) const ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#10 0x00007f1d40126f5d in triton::backend::pytorch::ModelInstanceState::Execute(std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> >, unsigned int, std::vector<c10::IValue, std::allocatorc10::IValue >, std::vector<at::Tensor, std::allocatorat::Tensor >) () from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#11 0x00007f1d4012d255 in triton::backend::pytorch::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int) ()
from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#12 0x00007f1d4012eaa4 in TRITONBACKEND_ModelInstanceExecute () from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#13 0x00007f1df80b0faa in nvidia::inferenceserver::TritonModelInstance::Execute(std::vector<TRITONBACKEND_Request, std::allocator<TRITONBACKEND_Request*> >&) ()
from /opt/tritonserver/lib/libtritonserver.so
#14 0x00007f1df80b1857 in nvidia::inferenceserver::TritonModelInstance::Schedule(std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_deletenvidia::inferenceserver::InferenceRequest >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_deletenvidia::inferenceserver::InferenceRequest > > >&&, std::function<void ()> const&) () from /opt/tritonserver/lib/libtritonserver.so
#15 0x00007f1df7f5ccc1 in nvidia::inferenceserver::Payload::Execute(bool*) () from /opt/tritonserver/lib/libtritonserver.so
#16 0x00007f1df80ab4f7 in nvidia::inferenceserver::TritonModelInstance::TritonBackendThread::BackendThread(int, int) () from /opt/tritonserver/lib/libtritonserver.so
#17 0x00007f1df9292de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007f1df9d0c609 in start_thread (arg=) at pthread_create.c:477
#19 0x00007f1df8f7d163 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Triton Information
i use ngc triton container 21.11, tritonversion: 2.16, pytorch:1.11.0a0+b6df043
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.5

Are you using the Triton container or did you build it yourself?
yes i use ngc triton container 21.11

To Reproduce
start nvidia mps and run two process

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
instance_group [
{
count: 8
kind: KIND_GPU
gpus:[0,1]
}
]

parameters: {
key: "DISABLE_OPTIMIZED_EXECUTION"
value: {
string_value:"true"
}
}

parameters: {
key: "ENABLE_NVFUSER"
value: {
string_value: "true"
}
}

parameters: {
key: "INFERENCE_MODE"
value: {
string_value: "true"
}
}

input [
{
name: "xwithouttone__0"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "tone__1"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "prosodyx__2"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "emotionid__3"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "emotionlevel__4"
data_type: TYPE_FP32
format: FORMAT_NONE
dims: [-1]
},
{
name: "alpha__5"
data_type: TYPE_FP32
format: FORMAT_NONE
dims: [-1]
}

]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [1,-1,256]
}
]

default_model_filename: "20220804_novel_f25_fastspeech_tensorRT_a30.plan"

optimization {
input_pinned_memory {
enable: true
},
output_pinned_memory {
enable: true
}
}
Expected behavior
A clear and concise description of what you expected to happen.

jackzhou121 · 2022-08-17T06:43:40Z

platform: "pytorch_libtorch"

dyastremsky · 2022-08-17T17:33:29Z

Thanks for providing the config and backtrace. Can you run this model in the latest container (22.07) and report the results?

jackzhou121 · 2022-08-22T08:18:34Z

the problem is gone when run model in the latest container(22.07)

jackzhou121 · 2022-08-23T07:08:22Z

Why triton container 21.11 sometimes crashed, if i start two processes and waramup models ?

dyastremsky · 2022-08-26T17:15:20Z

I'd need the backtrace and run commands to start looking to see why. Possibly the model to see if I can reproduce the bug.

Is it happening in the latest container? It's possible there was a concurrency bug that's been fixed in the last 9 months/versions.

jackzhou121 · 2022-08-29T04:39:26Z

The latest container r22.07 is ok. dyastremsky ***@***.***>于2022年8月27日周六01:15写道：

…

I'd need the backtrace and run commands to start looking to see why. Possibly the model to see if I can reproduce the bug. Is it happening in the latest container? It's possible there was a concurrency bug that's been fixed in the last 9 months/versions. — Reply to this email directly, view it on GitHub <#4778 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADHQULLLFWJJH2RFFKTWXBDV3D3TFANCNFSM56YKBL5Q> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

dyastremsky · 2022-08-29T17:28:53Z

Great. Do you still need this looked into? The backtrace can help you locate where the error is happening to debug. If it's already been fixed, it may not make sense to look into (we only patch future releases).

dyastremsky added the bug Something isn't working label Aug 17, 2022

jackzhou121 closed this as completed Aug 22, 2022

jackzhou121 reopened this Aug 23, 2022

jackzhou121 closed this as completed Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triton pytorch backend malloc coredump #4778

triton pytorch backend malloc coredump #4778

jackzhou121 commented Aug 17, 2022

jackzhou121 commented Aug 17, 2022

dyastremsky commented Aug 17, 2022

jackzhou121 commented Aug 22, 2022

jackzhou121 commented Aug 23, 2022

dyastremsky commented Aug 26, 2022

jackzhou121 commented Aug 29, 2022 via email

dyastremsky commented Aug 29, 2022

triton pytorch backend malloc coredump #4778

triton pytorch backend malloc coredump #4778

Comments

jackzhou121 commented Aug 17, 2022

jackzhou121 commented Aug 17, 2022

dyastremsky commented Aug 17, 2022

jackzhou121 commented Aug 22, 2022

jackzhou121 commented Aug 23, 2022

dyastremsky commented Aug 26, 2022

jackzhou121 commented Aug 29, 2022 via email

dyastremsky commented Aug 29, 2022