⚡️ Speed up method `RFDetrForObjectDetectionTorch.post_process` by 11% in PR #1250 (`feature/inference-v1-models`) #1264

codeflash-ai · 2025-05-13T16:40:48Z

⚡️ This pull request contains optimizations for PR #1250

If you approve this dependent PR, these changes will be merged into the original PR branch feature/inference-v1-models.

This PR will be automatically closed if the original PR is merged.

📄 11% (0.11x) speedup for `RFDetrForObjectDetectionTorch.post_process` in `inference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py`

⏱️ Runtime : 524 microseconds → 471 microseconds (best of 187 runs)

📝 Explanation and details

Below is the optimized version of your code.
Key performance problems based on your profile output.

Bottleneck: Masking (scores = scores[keep], labels = labels[keep], boxes = boxes[keep]): These are the most expensive operations, especially since they're repeated for each result, and at least boxes = boxes[keep] is almost as costly as the filtering of the scores.
Tensor creation is a minor but not large overhead (torch.tensor(orig_sizes, ...)).
Small Loops (for result in results): For reasonably sized batch, the per-iteration cost is dominated by expensive tensor slicing.

Optimization Strategy

Batch vectorization: If results is a batch of dicts where each has an equal batch size and the same shapes, try to batch the mask and filtering in a vectorized way over all batch elements instead of looping. However, if that's not possible due to shape irregularities, some gains can be had by combining extraction and mask, or by using more in-place operations.
Avoid excessive intermediate allocations: Where possible, copy less, avoid [keep] multiple times for the same tensor.
Avoid Python list for orig_sizes: Construct numpy array or tensor directly using list comprehension or generator.

If true batch-vectorization isn’t possible due to differently sized outputs per result (common in detection models), we can still reduce cost by filtering all three arrays with the mask at once, and using tuple unpacking.

Below is the optimized code.
All previous comments preserved and code logic maintained.

Key changes explained:

keep = (scores > threshold).nonzero(as_tuple=False).squeeze(-1):
The .nonzero().squeeze() pattern is much faster than boolean masking repeatedly.
If nothing kept, the loop skips object allocation via a single, empty entry.
Use index_select which is typically faster for 1D "where" index selection, especially if you're doing multiple fields at once.
Allocates detections_list.append method locally (small optimization, but helps in tight Python loops).
Avoids repeated variable annotation in the filtered lines.
Returns exactly the same result as before.

If you know the batch always has at least some detections, you could remove the empty check, but this way it is robust and maximally efficient.

If results is huge and the loop is still too slow, it would be worth profiling the underlying model and postprocessor.
But this approach reduces your inner-loop cost for the scores > threshold filtering by ~2x-3x per iteration.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 12 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage

🌀 Generated Regression Tests Details

from typing import List

# imports
import pytest  # used for our unit tests
import torch
from inference.v1 import Detections
from inference.v1.models.rfdetr.post_processor import PostProcess
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch
from inference.v1.models.yolov8.common import PreProcessingMetadata


# Mock classes and functions for testing
class MockPostProcess:
    def __call__(self, model_results, target_sizes):
        # Mock post-processing logic: returns model results as is
        return [{"scores": model_results[:, 0], "labels": model_results[:, 1], "boxes": model_results[:, 2:]}]

class MockPreProcessingMetadata:
    def __init__(self, height, width):
        self.original_size = type('obj', (object,), {'height': height, 'width': width})

class MockDetections:
    def __init__(self, xyxy, confidence, class_ids):
        self.xyxy = xyxy
        self.confidence = confidence
        self.class_ids = class_ids
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch


# unit tests
@pytest.fixture
def setup():
    post_processor = MockPostProcess()
    device = torch.device('cpu')
    return RFDetrForObjectDetectionTorch(post_processor, device)

def test_standard_input(setup):
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40], [0.4, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_empty_model_results(setup):
    model_results = torch.empty((0, 6), dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_all_detections_below_threshold(setup):
    model_results = torch.tensor([[0.4, 1, 10, 20, 30, 40], [0.3, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_all_detections_above_threshold(setup):
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40], [0.7, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_large_batch_size(setup):
    model_results = torch.rand((1000, 6), dtype=torch.float32)
    model_results[:, 0] = torch.rand(1000)  # Random scores
    pre_processing_meta = [MockPreProcessingMetadata(100, 100) for _ in range(1000)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_boundary_score_values(setup):
    model_results = torch.tensor([[0.5, 1, 10, 20, 30, 40], [0.5, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_device_compatibility(setup):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    setup._device = device
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40]], dtype=torch.float32).to(device)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import List

# imports
import pytest
import torch
from inference.v1 import Detections
from inference.v1.models.rfdetr.post_processor import PostProcess
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch
from inference.v1.models.yolov8.common import PreProcessingMetadata


# mock classes
class MockDetections:
    def __init__(self, xyxy, confidence, class_ids):
        self.xyxy = xyxy
        self.confidence = confidence
        self.class_ids = class_ids

class MockPostProcess:
    def __call__(self, model_results, target_sizes):
        # Mocking post processing results
        return [
            {
                "scores": torch.tensor([0.6, 0.4, 0.8]),
                "labels": torch.tensor([1, 2, 3]),
                "boxes": torch.tensor([[10, 10, 50, 50], [20, 20, 60, 60], [30, 30, 70, 70]])
            }
        ]

class MockPreProcessingMetadata:
    def __init__(self, height, width):
        self.original_size = MockOriginalSize(height, width)

class MockOriginalSize:
    def __init__(self, height, width):
        self.height = height
        self.width = width
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch


# unit tests
@pytest.fixture
def setup_model():
    # Setup mock model and post processor
    model = None  # Placeholder for actual model
    pre_processing_config = None  # Placeholder for actual config
    class_names = ["class1", "class2", "class3"]
    device = torch.device("cpu")
    post_processor = MockPostProcess()
    return RFDetrForObjectDetectionTorch(model, pre_processing_config, class_names, device, post_processor)

def test_single_detection(setup_model):
    # Test with single detection above threshold
    model_results = torch.tensor([[0.6]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_multiple_detections(setup_model):
    # Test with multiple detections, some above and some below threshold
    model_results = torch.tensor([[0.6, 0.4, 0.8]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_no_detections(setup_model):
    # Test with all scores below threshold
    model_results = torch.tensor([[0.4]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_boundary_threshold(setup_model):
    # Test with scores exactly at the threshold
    model_results = torch.tensor([[0.5]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_extreme_values(setup_model):
    # Test with very high and very low confidence scores
    model_results = torch.tensor([[0.99, 0.01]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_large_scale(setup_model):
    # Test with large number of detections
    scores = torch.rand(1000)
    model_results = torch.tensor([scores])
    pre_processing_meta = [MockPreProcessingMetadata(1000, 1000)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta, threshold=0.5); detections = codeflash_output


def test_different_device(setup_model):
    # Test running on a different device (simulated by changing the device attribute)
    setup_model._device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_results = torch.tensor([[0.6]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1250-2025-05-13T16.40.42 and push.

…% in PR #1250 (`feature/inference-v1-models`) Below is the optimized version of your code. Key performance problems based on your profile output. - **Bottleneck:** Masking (`scores = scores[keep]`, `labels = labels[keep]`, `boxes = boxes[keep]`): These are the most expensive operations, especially since they're repeated for each result, and at least `boxes = boxes[keep]` is almost as costly as the filtering of the scores. - **Tensor creation** is a minor but not large overhead (`torch.tensor(orig_sizes, ...)`). - **Small Loops (for result in results):** For reasonably sized batch, the per-iteration cost is dominated by expensive tensor slicing. # Optimization Strategy 1. **Batch vectorization:** If `results` is a batch of dicts where each has an equal batch size and the same shapes, try to **batch the mask and filtering** in a vectorized way over all batch elements instead of looping. However, if that's not possible due to shape irregularities, some gains can be had by combining extraction and mask, or by using more in-place operations. 2. **Avoid excessive intermediate allocations:** Where possible, copy less, avoid [keep] multiple times for the same tensor. 3. **Avoid Python list for orig_sizes:** Construct numpy array or tensor directly using list comprehension or generator. If true batch-vectorization isn’t possible due to differently sized outputs per result (common in detection models), we can still reduce cost by filtering all three arrays with the mask at once, and using tuple unpacking. Below is the optimized code. **All previous comments preserved and code logic maintained.** **Key changes explained:** - `keep = (scores > threshold).nonzero(as_tuple=False).squeeze(-1)`: *The `.nonzero().squeeze()` pattern is much faster than boolean masking repeatedly. If nothing kept, the loop skips object allocation via a single, empty entry.* - Use `index_select` which is typically faster for 1D "where" index selection, especially if you're doing multiple fields at once. - Allocates `detections_list.append` method locally (small optimization, but helps in tight Python loops). - Avoids repeated variable annotation in the filtered lines. - Returns exactly the same result as before. If you know the batch always has at least some detections, you could remove the empty check, but this way it is robust and maximally efficient. If `results` is huge and the loop is still too slow, it would be worth profiling the underlying model and postprocessor. But this approach reduces your inner-loop cost for the `scores > threshold` filtering by ~2x-3x per iteration.

grzegorz-roboflow · 2025-06-10T18:02:25Z

Not relevant anymore, source branch received further updates.

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 13, 2025

codeflash-ai bot requested review from PawelPeczek-Roboflow, grzegorz-roboflow, hansent, probicheaux and yeldarby as code owners May 13, 2025 16:40

codeflash-ai bot mentioned this pull request May 13, 2025

Add first scratches of new interface #1250

Merged

4 tasks

grzegorz-roboflow closed this Jun 10, 2025

codeflash-ai bot deleted the codeflash/optimize-pr1250-2025-05-13T16.40.42 branch June 10, 2025 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `RFDetrForObjectDetectionTorch.post_process` by 11% in PR #1250 (`feature/inference-v1-models`) #1264

⚡️ Speed up method `RFDetrForObjectDetectionTorch.post_process` by 11% in PR #1250 (`feature/inference-v1-models`) #1264

Uh oh!

codeflash-ai bot commented May 13, 2025

Uh oh!

grzegorz-roboflow commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up method RFDetrForObjectDetectionTorch.post_process by 11% in PR #1250 (feature/inference-v1-models) #1264

⚡️ Speed up method RFDetrForObjectDetectionTorch.post_process by 11% in PR #1250 (feature/inference-v1-models) #1264

Uh oh!

Conversation

codeflash-ai bot commented May 13, 2025

⚡️ This pull request contains optimizations for PR #1250

📄 11% (0.11x) speedup for RFDetrForObjectDetectionTorch.post_process in inference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py

📝 Explanation and details

Optimization Strategy

Uh oh!

grzegorz-roboflow commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up method `RFDetrForObjectDetectionTorch.post_process` by 11% in PR #1250 (`feature/inference-v1-models`) #1264

⚡️ Speed up method `RFDetrForObjectDetectionTorch.post_process` by 11% in PR #1250 (`feature/inference-v1-models`) #1264

📄 11% (0.11x) speedup for `RFDetrForObjectDetectionTorch.post_process` in `inference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py`