⚡️ Speed up function rescale_detections by 29% in PR #1250 (feature/inference-v1-models)
#1330
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #1250
If you approve this dependent PR, these changes will be merged into the original PR branch
feature/inference-v1-models.📄 29% (0.29x) speedup for
rescale_detectionsininference/v1/models/common/post_processing.py⏱️ Runtime :
24.8 milliseconds→19.2 milliseconds(best of95runs)📝 Explanation and details
Here’s an optimized version of your program.
The line profiling shows the regeneration of 1D tensors (
offsets,scale) and slicing in-place ops are the major time consumers, while all per-image operations—in a tight loop—cause overhead.Key ideas.
forloop with vectorization where possible.If your images always have the same metadata, full vectorization is possible.
If metadata varies per image (as profiling suggests), batch vectorization of the first four columns within a single kernel offers the main speed gain.
Below is a much faster version.
What Changed and Why
torch.as_tensorinside the tightest loop: Instead,offsetsandscalesfor all images are created once per batch.rescale_image_detections) still works, but consolidated with pre-built tensors (torch.tensor, notas_tensor) for better perf; this skips small hidden overheads.Further potential batch speedups
If you can pad all detection tensors to the same shape, you can batch-process the entire
detectionslist using broadcasting, for further speed.For now, this version assumes all detection tensors may have different lengths, which matches your usage pattern.
✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-pr1250-2025-06-05T15.39.56and push.