-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference results differ for diffrent batch size #5640
Comments
Triton never alters the output on batch size. Triton can group inputs into different batch sizes, but it is ultimately passed to the model for the output. I think this might be difference in behavior on the model regarding different batch sizes. CC @oandreeva-nv in case you have some input on this. |
I tend to agree that underlying computations on different batch sizes may cause discrepancies. CC @rmccorm4 , if I remember correctly, you were exploring similar issues at some point? |
Hi @casperroo, Yes it is generally expected that different batch sizes when executed on GPU can have slight variance in their results. This is generally due to different CUDA Kernels being selected based on the batch size. Some frameworks provide certain APIs to make it more deterministic, there are some relevant docs about that for PyTorch here, but the same concepts generally apply to TensorFlow and other frameworks: https://github.com/triton-inference-server/python_backend/tree/main#determinism-and-reproducibility |
Thank you guys very much, you have pointed me into the right direction. The "determinism" is the keyword that I have been missing. https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism The following bit explains it very well: "These differences are often caused by the use of asynchronous threads within the op nondeterministically changing the order in which floating-point numbers are added. Most of these cases of nondeterminism occur on GPUs, which have thousands of hardware threads that are used to run ops" So this is actually so obvious once you read it: the sequence in which one executes operations on floats makes difference. |
Software version:
In short:
Is this expected behavior of the Triton Inference Server (or underlying backend) to yield different results for different batch sizes for the very same input data?
I am testing image classification with efficientnet.
The differences are minimal but I wonder if this is expected or some batched data gets overlapped somewhere in the pipeline?
My test:
I start the triton inference server and serve a model (efficientnet) with the following config:
I run the client with batch size of 1:
With batch size of 2:
With batch size of 10:
So 1.png depending on batch size yields the following score for l12:
I get the same results when I run it from my client with cuda shared memory.
Disabling the gpu_execution_accelerators doesn't change the behavior.
The text was updated successfully, but these errors were encountered: