⚡️ Speed up method RFDetrForObjectDetectionTorch.post_process by 11% in PR #1250 (feature/inference-v1-models)
#1264
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #1250
If you approve this dependent PR, these changes will be merged into the original PR branch
feature/inference-v1-models.📄 11% (0.11x) speedup for
RFDetrForObjectDetectionTorch.post_processininference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py⏱️ Runtime :
524 microseconds→471 microseconds(best of187runs)📝 Explanation and details
Below is the optimized version of your code.
Key performance problems based on your profile output.
scores = scores[keep],labels = labels[keep],boxes = boxes[keep]): These are the most expensive operations, especially since they're repeated for each result, and at leastboxes = boxes[keep]is almost as costly as the filtering of the scores.torch.tensor(orig_sizes, ...)).Optimization Strategy
resultsis a batch of dicts where each has an equal batch size and the same shapes, try to batch the mask and filtering in a vectorized way over all batch elements instead of looping. However, if that's not possible due to shape irregularities, some gains can be had by combining extraction and mask, or by using more in-place operations.If true batch-vectorization isn’t possible due to differently sized outputs per result (common in detection models), we can still reduce cost by filtering all three arrays with the mask at once, and using tuple unpacking.
Below is the optimized code.
All previous comments preserved and code logic maintained.
Key changes explained:
keep = (scores > threshold).nonzero(as_tuple=False).squeeze(-1):The
.nonzero().squeeze()pattern is much faster than boolean masking repeatedly.If nothing kept, the loop skips object allocation via a single, empty entry.
Use
index_selectwhich is typically faster for 1D "where" index selection, especially if you're doing multiple fields at once.Allocates
detections_list.appendmethod locally (small optimization, but helps in tight Python loops).Avoids repeated variable annotation in the filtered lines.
Returns exactly the same result as before.
If you know the batch always has at least some detections, you could remove the empty check, but this way it is robust and maximally efficient.
If
resultsis huge and the loop is still too slow, it would be worth profiling the underlying model and postprocessor.But this approach reduces your inner-loop cost for the
scores > thresholdfiltering by ~2x-3x per iteration.✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-pr1250-2025-05-13T16.40.42and push.