Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented May 13, 2025

⚡️ This pull request contains optimizations for PR #1250

If you approve this dependent PR, these changes will be merged into the original PR branch feature/inference-v1-models.

This PR will be automatically closed if the original PR is merged.


📄 11% (0.11x) speedup for RFDetrForObjectDetectionTorch.post_process in inference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py

⏱️ Runtime : 524 microseconds 471 microseconds (best of 187 runs)

📝 Explanation and details

Below is the optimized version of your code.
Key performance problems based on your profile output.

  • Bottleneck: Masking (scores = scores[keep], labels = labels[keep], boxes = boxes[keep]): These are the most expensive operations, especially since they're repeated for each result, and at least boxes = boxes[keep] is almost as costly as the filtering of the scores.
  • Tensor creation is a minor but not large overhead (torch.tensor(orig_sizes, ...)).
  • Small Loops (for result in results): For reasonably sized batch, the per-iteration cost is dominated by expensive tensor slicing.

Optimization Strategy

  1. Batch vectorization: If results is a batch of dicts where each has an equal batch size and the same shapes, try to batch the mask and filtering in a vectorized way over all batch elements instead of looping. However, if that's not possible due to shape irregularities, some gains can be had by combining extraction and mask, or by using more in-place operations.
  2. Avoid excessive intermediate allocations: Where possible, copy less, avoid [keep] multiple times for the same tensor.
  3. Avoid Python list for orig_sizes: Construct numpy array or tensor directly using list comprehension or generator.

If true batch-vectorization isn’t possible due to differently sized outputs per result (common in detection models), we can still reduce cost by filtering all three arrays with the mask at once, and using tuple unpacking.

Below is the optimized code.
All previous comments preserved and code logic maintained.

Key changes explained:

  • keep = (scores > threshold).nonzero(as_tuple=False).squeeze(-1):
    The .nonzero().squeeze() pattern is much faster than boolean masking repeatedly.
    If nothing kept, the loop skips object allocation via a single, empty entry.

  • Use index_select which is typically faster for 1D "where" index selection, especially if you're doing multiple fields at once.

  • Allocates detections_list.append method locally (small optimization, but helps in tight Python loops).

  • Avoids repeated variable annotation in the filtered lines.

  • Returns exactly the same result as before.

If you know the batch always has at least some detections, you could remove the empty check, but this way it is robust and maximally efficient.

If results is huge and the loop is still too slow, it would be worth profiling the underlying model and postprocessor.
But this approach reduces your inner-loop cost for the scores > threshold filtering by ~2x-3x per iteration.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 12 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage
🌀 Generated Regression Tests Details
from typing import List

# imports
import pytest  # used for our unit tests
import torch
from inference.v1 import Detections
from inference.v1.models.rfdetr.post_processor import PostProcess
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch
from inference.v1.models.yolov8.common import PreProcessingMetadata


# Mock classes and functions for testing
class MockPostProcess:
    def __call__(self, model_results, target_sizes):
        # Mock post-processing logic: returns model results as is
        return [{"scores": model_results[:, 0], "labels": model_results[:, 1], "boxes": model_results[:, 2:]}]

class MockPreProcessingMetadata:
    def __init__(self, height, width):
        self.original_size = type('obj', (object,), {'height': height, 'width': width})

class MockDetections:
    def __init__(self, xyxy, confidence, class_ids):
        self.xyxy = xyxy
        self.confidence = confidence
        self.class_ids = class_ids
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch


# unit tests
@pytest.fixture
def setup():
    post_processor = MockPostProcess()
    device = torch.device('cpu')
    return RFDetrForObjectDetectionTorch(post_processor, device)

def test_standard_input(setup):
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40], [0.4, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_empty_model_results(setup):
    model_results = torch.empty((0, 6), dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_all_detections_below_threshold(setup):
    model_results = torch.tensor([[0.4, 1, 10, 20, 30, 40], [0.3, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_all_detections_above_threshold(setup):
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40], [0.7, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_large_batch_size(setup):
    model_results = torch.rand((1000, 6), dtype=torch.float32)
    model_results[:, 0] = torch.rand(1000)  # Random scores
    pre_processing_meta = [MockPreProcessingMetadata(100, 100) for _ in range(1000)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_boundary_score_values(setup):
    model_results = torch.tensor([[0.5, 1, 10, 20, 30, 40], [0.5, 2, 15, 25, 35, 45]], dtype=torch.float32)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_device_compatibility(setup):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    setup._device = device
    model_results = torch.tensor([[0.6, 1, 10, 20, 30, 40]], dtype=torch.float32).to(device)
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup.post_process(model_results, pre_processing_meta); detections = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import List

# imports
import pytest
import torch
from inference.v1 import Detections
from inference.v1.models.rfdetr.post_processor import PostProcess
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch
from inference.v1.models.yolov8.common import PreProcessingMetadata


# mock classes
class MockDetections:
    def __init__(self, xyxy, confidence, class_ids):
        self.xyxy = xyxy
        self.confidence = confidence
        self.class_ids = class_ids

class MockPostProcess:
    def __call__(self, model_results, target_sizes):
        # Mocking post processing results
        return [
            {
                "scores": torch.tensor([0.6, 0.4, 0.8]),
                "labels": torch.tensor([1, 2, 3]),
                "boxes": torch.tensor([[10, 10, 50, 50], [20, 20, 60, 60], [30, 30, 70, 70]])
            }
        ]

class MockPreProcessingMetadata:
    def __init__(self, height, width):
        self.original_size = MockOriginalSize(height, width)

class MockOriginalSize:
    def __init__(self, height, width):
        self.height = height
        self.width = width
from inference.v1.models.rfdetr.rfdetr_object_detection_pytorch import \
    RFDetrForObjectDetectionTorch


# unit tests
@pytest.fixture
def setup_model():
    # Setup mock model and post processor
    model = None  # Placeholder for actual model
    pre_processing_config = None  # Placeholder for actual config
    class_names = ["class1", "class2", "class3"]
    device = torch.device("cpu")
    post_processor = MockPostProcess()
    return RFDetrForObjectDetectionTorch(model, pre_processing_config, class_names, device, post_processor)

def test_single_detection(setup_model):
    # Test with single detection above threshold
    model_results = torch.tensor([[0.6]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_multiple_detections(setup_model):
    # Test with multiple detections, some above and some below threshold
    model_results = torch.tensor([[0.6, 0.4, 0.8]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_no_detections(setup_model):
    # Test with all scores below threshold
    model_results = torch.tensor([[0.4]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_boundary_threshold(setup_model):
    # Test with scores exactly at the threshold
    model_results = torch.tensor([[0.5]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_extreme_values(setup_model):
    # Test with very high and very low confidence scores
    model_results = torch.tensor([[0.99, 0.01]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output

def test_large_scale(setup_model):
    # Test with large number of detections
    scores = torch.rand(1000)
    model_results = torch.tensor([scores])
    pre_processing_meta = [MockPreProcessingMetadata(1000, 1000)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta, threshold=0.5); detections = codeflash_output


def test_different_device(setup_model):
    # Test running on a different device (simulated by changing the device attribute)
    setup_model._device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_results = torch.tensor([[0.6]])
    pre_processing_meta = [MockPreProcessingMetadata(100, 100)]
    codeflash_output = setup_model.post_process(model_results, pre_processing_meta); detections = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1250-2025-05-13T16.40.42 and push.

Codeflash

…% in PR #1250 (`feature/inference-v1-models`)

Below is the optimized version of your code.  
Key performance problems based on your profile output.
- **Bottleneck:** Masking (`scores = scores[keep]`, `labels = labels[keep]`, `boxes = boxes[keep]`): These are the most expensive operations, especially since they're repeated for each result, and at least `boxes = boxes[keep]` is almost as costly as the filtering of the scores.
- **Tensor creation** is a minor but not large overhead (`torch.tensor(orig_sizes, ...)`).
- **Small Loops (for result in results):** For reasonably sized batch, the per-iteration cost is dominated by expensive tensor slicing.

# Optimization Strategy

1. **Batch vectorization:** If `results` is a batch of dicts where each has an equal batch size and the same shapes, try to **batch the mask and filtering** in a vectorized way over all batch elements instead of looping. However, if that's not possible due to shape irregularities, some gains can be had by combining extraction and mask, or by using more in-place operations.
2. **Avoid excessive intermediate allocations:** Where possible, copy less, avoid [keep] multiple times for the same tensor.
3. **Avoid Python list for orig_sizes:** Construct numpy array or tensor directly using list comprehension or generator.

If true batch-vectorization isn’t possible due to differently sized outputs per result (common in detection models), we can still reduce cost by filtering all three arrays with the mask at once, and using tuple unpacking.

Below is the optimized code.  
**All previous comments preserved and code logic maintained.**



**Key changes explained:**

- `keep = (scores > threshold).nonzero(as_tuple=False).squeeze(-1)`:  
  *The `.nonzero().squeeze()` pattern is much faster than boolean masking repeatedly.  
  If nothing kept, the loop skips object allocation via a single, empty entry.*

- Use `index_select` which is typically faster for 1D "where" index selection, especially if you're doing multiple fields at once.

- Allocates `detections_list.append` method locally (small optimization, but helps in tight Python loops).

- Avoids repeated variable annotation in the filtered lines.

- Returns exactly the same result as before.

If you know the batch always has at least some detections, you could remove the empty check, but this way it is robust and maximally efficient.

If `results` is huge and the loop is still too slow, it would be worth profiling the underlying model and postprocessor.  
But this approach reduces your inner-loop cost for the `scores > threshold` filtering by ~2x-3x per iteration.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 13, 2025
@codeflash-ai codeflash-ai bot mentioned this pull request May 13, 2025
4 tasks
@grzegorz-roboflow
Copy link
Collaborator

Not relevant anymore, source branch received further updates.

@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1250-2025-05-13T16.40.42 branch June 10, 2025 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants