Inference time jumps for varying batch size #53725
Labels
2.6.0
comp:gpu
GPU related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
type:performance
Performance Issue
System information
Current behavior
When running classifier inference on GPU, if an input batch size is seen for the first time, the inference time is more than expected. In subsequent runs for the same input batch size, the inference time reduces. When the inference time jump is observed, the load shifts to CPU (GPU usage drops in nvidia-smi) while on subsequent inferences the load is on GPU.
Example 1:
For a random batch size, the inference time on
run 2
reduces because it is seen inrun 1
. Inrun 3
, all the batch sizes are seen inrun 2
and their inference time reduces.Example 2:
For 10 random batch sizes, the inference time in
run 2
reduces because all these batches are seen inrun 1
. Inrun 3
, all the batch sizes are seen inrun 2
and their inference time reduces.Why is this relevant?
Suppose we have a video sequence with varying number of objects every few frames (i.e., the batch size=number of objects varies every few frames). Every time there are total number of objects in a frame that have not been seen before, there is a jump in inference time. For example, if there are 10 objects in the first 30 frames and then there are 8 objects in the next frame, an inference time jump is observed for this frame. In a product with real time expectations, this can have system level implementations on other modules.
Expected behavior
Inference time jumps should not be observed with varying batch sizes as it can have system level implications. This behavior is not observed with PyTorch.
An easy solution is to run classifier inference with all batch sizes on dummy images during initialization. For example, we can run inference for batch sizes from 1 to 64 if maximum expected objects are 64. However, that's more of a hack. I am interested in understanding the reason behind this issue. Looks like it has something to do with memory allocation - but why is it dependent on batch size? Is there a better Tensorflow configuration or inference function or memory allocation that can help resolve it?
Standalone code to reproduce the issue
Jupyter notebook in Colab:
https://colab.research.google.com/drive/1fHy3HcrYBskMLy-nNn12bbwBI-QIxg1c?usp=sharing
This notebook uses model() for inference. Using model.predict(), gives similar results.
To enable GPU in Colab, select GPU
Runtime -> Change Runtime Type -> Hardware Accelerator -> Select GPU from drop down
The text was updated successfully, but these errors were encountered: