-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference #4079
Comments
The error seemed to come from our client script. We rewrote it and inference times normalized. |
Reopening the issue since the issue is occurring again, this time with the perf_client as well. I made no modifications to the perf_client whatsoever, and am using the same config setup as detailed here (with a different model name in the config to help differentiate between the ensemble version and the standalone version with no preprocessing) |
Here is a full verbose log from the run (with only 1 inference, as I believe adding any more would simply be unnessesary)
|
We can the client and the inference server on the same machine and inference was returned in ~50ms, which was the expected behaviour. However, running the same client script on a different machine still results in a 3 second delay in inference. We are not using the shared memory command when launching the inference server or in the client and everything is done over GRPC. |
Following up on the network latency theory from above ^ we tested on 3 different systems, all running Ubuntu 20.04. The Triton Inference server is located on a GCloud VM that is running on a Tesla T4. It is located in us-west1-b. All tests were ran using the same server instance/container and the same client script/image. We tested on the same VM that was running the Triton Inference Server, and it only took ~50ms to get back an inference. All of the networks being used were minimum 1GB/s at the minimum. So our theory is that Triton has to essentially wait for the entire image to come in to the inference server before starting inference. But that image takes a while to reach the inference server to come in, causing Triton to wait, thus resulting in increased inference round trip time. |
We switched to using the HTTP client, since we're more familiar with HTTP than GRPC. We also started using We are using NGINX with the VM. It is unlikely this has anything to do with increasing the inference time, but wanted to mention this so potential investigators have a better understanding of the setup. |
Here is a more simplified output of a request lifetime
00000 02:47:30.059975 1 Triton Client actually sending now Part #2
Part #3
This particular request took ~1600ms, so these 3 big chunks are taking up ~95% of the inference roundtrip time. I'll try to find out if the client is just sitting on the request and response, or if the server is taking longer than usual to process the request/response. |
Why do you think this is anything more that networking overhead? |
I no longer do. I thought it was something more, since one of the chunks of time that was causing problems was within the actual Triton inference process, but that issue has since been resolved by us. I'm wondering if there is a way to reduce the networking overhead, but so far, I haven't found anything significantly fruitful. |
Closing due to this being unrelated to Triton. |
Hello,
Just recently setup Triton to work with our YoloV4/TensorRT model. However, I noticed that inference is taking a long time, ~3 seconds. This didn't seem right, so I fetched the verbose logs and got this back
So it seems that the actual inference only takes ~10-15 milliseconds, which seems normal. However, before that, the logs show
I tried to search in the docs, on Google and in other issues to figure out what exactly is going on, and why this is taking 3 seconds to run, but didn't find much unfortunately. At first I thought it was because of model state, but according to the docs, CNNs like YOLO should be stateless. I'm unsure how to resolve this.
This is the config.pbtxt
and this is how the docker container running Triton is deployed
The text was updated successfully, but these errors were encountered: