-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
triton pytorch backend malloc coredump #4778
Comments
platform: "pytorch_libtorch" |
Thanks for providing the config and backtrace. Can you run this model in the latest container (22.07) and report the results? |
the problem is gone when run model in the latest container(22.07) |
Why triton container 21.11 sometimes crashed, if i start two processes and waramup models ? |
I'd need the backtrace and run commands to start looking to see why. Possibly the model to see if I can reproduce the bug. Is it happening in the latest container? It's possible there was a concurrency bug that's been fixed in the last 9 months/versions. |
The latest container r22.07 is ok.
dyastremsky ***@***.***>于2022年8月27日 周六01:15写道:
… I'd need the backtrace and run commands to start looking to see why.
Possibly the model to see if I can reproduce the bug.
Is it happening in the latest container? It's possible there was a
concurrency bug that's been fixed in the last 9 months/versions.
—
Reply to this email directly, view it on GitHub
<#4778 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADHQULLLFWJJH2RFFKTWXBDV3D3TFANCNFSM56YKBL5Q>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
Great. Do you still need this looked into? The backtrace can help you locate where the error is happening to debug. If it's already been fixed, it may not make sense to look into (we only patch future releases). |
Description
I use triton sdk to do torchscript model inference, i use two process with nvidia-mps, and i found sometimes one of two process failed with coredump, i use gdb to debug the problem, here is the bt info:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1df8ef823b in malloc () from /usr/lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f1c817fe000 (LWP 126))]
(gdb) bt
#0 0x00007f1df8ef823b in malloc () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f1df9266b39 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f1df1a21c1c in void std::vector<torch::jit::Value*, std::allocatortorch::jit::Value* >::emplace_backtorch::jit::Value*(torch::jit::Value*&&) ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#3 0x00007f1df1acffee in torch::jit::Node::addOutput() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#4 0x00007f1df1ad76c5 in torch::jit::Block::cloneFrom(torch::jit::Block*, std::function<torch::jit::Value* (torch::jit::Value*)>) () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#5 0x00007f1df1ad7f84 in torch::jit::Graph::copy() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#6 0x00007f1df19a8724 in torch::jit::GraphFunction::get_executor() () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#7 0x00007f1df19a579e in torch::jit::GraphFunction::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#8 0x00007f1df19a5c5e in torch::jit::GraphFunction::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, c10::IValue> > > const&) ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#9 0x00007f1df19b84bb in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, c10::IValue> > > const&) const ()
from /opt/tritonserver/backends/pytorch/libtorch_cpu.so
#10 0x00007f1d40126f5d in triton::backend::pytorch::ModelInstanceState::Execute(std::vector<TRITONBACKEND_Response*, std::allocator<TRITONBACKEND_Response*> >, unsigned int, std::vector<c10::IValue, std::allocatorc10::IValue >, std::vector<at::Tensor, std::allocatorat::Tensor >) () from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#11 0x00007f1d4012d255 in triton::backend::pytorch::ModelInstanceState::ProcessRequests(TRITONBACKEND_Request**, unsigned int) ()
from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#12 0x00007f1d4012eaa4 in TRITONBACKEND_ModelInstanceExecute () from /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
#13 0x00007f1df80b0faa in nvidia::inferenceserver::TritonModelInstance::Execute(std::vector<TRITONBACKEND_Request, std::allocator<TRITONBACKEND_Request*> >&) ()
from /opt/tritonserver/lib/libtritonserver.so
#14 0x00007f1df80b1857 in nvidia::inferenceserver::TritonModelInstance::Schedule(std::vector<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_deletenvidia::inferenceserver::InferenceRequest >, std::allocator<std::unique_ptr<nvidia::inferenceserver::InferenceRequest, std::default_deletenvidia::inferenceserver::InferenceRequest > > >&&, std::function<void ()> const&) () from /opt/tritonserver/lib/libtritonserver.so
#15 0x00007f1df7f5ccc1 in nvidia::inferenceserver::Payload::Execute(bool*) () from /opt/tritonserver/lib/libtritonserver.so
#16 0x00007f1df80ab4f7 in nvidia::inferenceserver::TritonModelInstance::TritonBackendThread::BackendThread(int, int) () from /opt/tritonserver/lib/libtritonserver.so
#17 0x00007f1df9292de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007f1df9d0c609 in start_thread (arg=) at pthread_create.c:477
#19 0x00007f1df8f7d163 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6
Triton Information
i use ngc triton container 21.11, tritonversion: 2.16, pytorch:1.11.0a0+b6df043
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.5
Are you using the Triton container or did you build it yourself?
yes i use ngc triton container 21.11
To Reproduce
start nvidia mps and run two process
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
instance_group [
{
count: 8
kind: KIND_GPU
gpus:[0,1]
}
]
parameters: {
key: "DISABLE_OPTIMIZED_EXECUTION"
value: {
string_value:"true"
}
}
parameters: {
key: "ENABLE_NVFUSER"
value: {
string_value: "true"
}
}
parameters: {
key: "INFERENCE_MODE"
value: {
string_value: "true"
}
}
input [
{
name: "xwithouttone__0"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "tone__1"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "prosodyx__2"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "emotionid__3"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [-1]
},
{
name: "emotionlevel__4"
data_type: TYPE_FP32
format: FORMAT_NONE
dims: [-1]
},
{
name: "alpha__5"
data_type: TYPE_FP32
format: FORMAT_NONE
dims: [-1]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [1,-1,256]
}
]
default_model_filename: "20220804_novel_f25_fastspeech_tensorRT_a30.plan"
optimization {
input_pinned_memory {
enable: true
},
output_pinned_memory {
enable: true
}
}
Expected behavior
A clear and concise description of what you expected to happen.
The text was updated successfully, but these errors were encountered: