Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jun 5, 2025

⚡️ This pull request contains optimizations for PR #1250

If you approve this dependent PR, these changes will be merged into the original PR branch feature/inference-v1-models.

This PR will be automatically closed if the original PR is merged.


📄 29% (0.29x) speedup for rescale_detections in inference/v1/models/common/post_processing.py

⏱️ Runtime : 24.8 milliseconds 19.2 milliseconds (best of 95 runs)

📝 Explanation and details

Here’s an optimized version of your program.
The line profiling shows the regeneration of 1D tensors (offsets, scale) and slicing in-place ops are the major time consumers, while all per-image operations—in a tight loop—cause overhead.

Key ideas.

  • Avoid per-row in-place ops. Instead, use a fused operation on the entire tensor stack, if possible.
  • Precompute offsets/scale as arrays or tensors in advance, in batch.
  • Replace Python for loop with vectorization where possible.
  • Fuse subtraction/division.
  • Avoid repeated creation of small tensors.
  • Keep compatibility with all input shapes and types.

If your images always have the same metadata, full vectorization is possible.
If metadata varies per image (as profiling suggests), batch vectorization of the first four columns within a single kernel offers the main speed gain.

Below is a much faster version.


What Changed and Why

  • No more per-call torch.as_tensor inside the tightest loop: Instead, offsets and scales for all images are created once per batch.
  • In-place operation still leveraged for memory efficiency; only slicing, subtraction and division happen in the inner-most loop (no new tensor allocations for every detection).
  • Single-image fallback (rescale_image_detections) still works, but consolidated with pre-built tensors (torch.tensor, not as_tensor) for better perf; this skips small hidden overheads.
  • Handles empty detections robustly.
  • Compliant function interface and comments preserved.

Further potential batch speedups

If you can pad all detection tensors to the same shape, you can batch-process the entire detections list using broadcasting, for further speed.
For now, this version assumes all detection tensors may have different lengths, which matches your usage pattern.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 34 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests Details
from typing import List

# imports
import pytest  # used for our unit tests
import torch
from inference.v1.models.common.post_processing import rescale_detections


# Dummy PreProcessingMetadata for testing
class PreProcessingMetadata:
    def __init__(self, pad_left, pad_top, scale_width, scale_height):
        self.pad_left = pad_left
        self.pad_top = pad_top
        self.scale_width = scale_width
        self.scale_height = scale_height
from inference.v1.models.common.post_processing import rescale_detections

# ------------------------------
# Unit tests for rescale_detections
# ------------------------------

# BASIC TEST CASES

def test_single_detection_no_padding_no_scaling():
    # One detection, no padding, no scaling
    det = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9]])
    meta = PreProcessingMetadata(pad_left=0, pad_top=0, scale_width=1, scale_height=1)
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_single_detection_with_padding():
    # One detection, with padding
    det = torch.tensor([[12.0, 22.0, 32.0, 42.0, 0.8]])
    meta = PreProcessingMetadata(pad_left=2, pad_top=2, scale_width=1, scale_height=1)
    expected = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.8]])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_single_detection_with_scaling():
    # One detection, with scaling
    det = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]])
    meta = PreProcessingMetadata(pad_left=0, pad_top=0, scale_width=2, scale_height=4)
    expected = torch.tensor([[5.0, 5.0, 15.0, 10.0, 0.7]])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_single_detection_with_padding_and_scaling():
    # One detection, with both padding and scaling
    det = torch.tensor([[14.0, 24.0, 34.0, 44.0, 0.6]])
    meta = PreProcessingMetadata(pad_left=4, pad_top=4, scale_width=2, scale_height=4)
    expected = torch.tensor([[5.0, 5.0, 15.0, 10.0, 0.6]])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_multiple_detections_single_image():
    # Multiple detections in a single image
    det = torch.tensor([
        [12.0, 22.0, 32.0, 42.0, 0.8],
        [22.0, 32.0, 52.0, 62.0, 0.7]
    ])
    meta = PreProcessingMetadata(pad_left=2, pad_top=2, scale_width=2, scale_height=2)
    expected = torch.tensor([
        [5.0, 10.0, 15.0, 20.0, 0.8],
        [10.0, 15.0, 25.0, 30.0, 0.7]
    ])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_multiple_images_varied_metadata():
    # Multiple images, each with different metadata
    det1 = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9]])
    meta1 = PreProcessingMetadata(0, 0, 2, 2)
    det2 = torch.tensor([[15.0, 25.0, 35.0, 45.0, 0.8]])
    meta2 = PreProcessingMetadata(5, 5, 1, 1)
    expected1 = torch.tensor([[5.0, 10.0, 15.0, 20.0, 0.9]])
    expected2 = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.8]])
    codeflash_output = rescale_detections([det1.clone(), det2.clone()], [meta1, meta2]); result = codeflash_output

# EDGE TEST CASES

def test_empty_detections_list():
    # No images at all
    codeflash_output = rescale_detections([], []); result = codeflash_output

def test_empty_detections_for_image():
    # Image with no detections
    det = torch.empty((0, 5))
    meta = PreProcessingMetadata(1, 1, 1, 1)
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_large_padding_and_scaling():
    # Large padding and scaling values
    det = torch.tensor([[1000.0, 2000.0, 3000.0, 4000.0, 0.5]])
    meta = PreProcessingMetadata(1000, 2000, 10, 20)
    expected = torch.tensor([[0.0, 0.0, 200.0, 100.0, 0.5]])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_negative_padding_and_scaling():
    # Negative padding (should add to coordinates) and negative scaling (should flip sign)
    det = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.3]])
    meta = PreProcessingMetadata(-10, -20, -2, -4)
    expected = torch.tensor([[(10+10)/-2, (20+20)/-4, (30+10)/-2, (40+20)/-4, 0.3]])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output


def test_mismatched_input_lengths():
    # Number of detections and metadata do not match
    dets = [torch.tensor([[1.,2.,3.,4.,0.1]])]
    metas = [
        PreProcessingMetadata(0,0,1,1),
        PreProcessingMetadata(0,0,1,1)
    ]
    # Should process up to min(len(detections), len(metadata)), so no error, but only one processed
    codeflash_output = rescale_detections(dets.copy(), metas.copy()); result = codeflash_output

def test_inplace_modification():
    # Ensure function modifies input in-place
    det = torch.tensor([[12.0, 22.0, 32.0, 42.0, 0.8]])
    meta = PreProcessingMetadata(2, 2, 2, 2)
    det_clone = det.clone()
    rescale_detections([det], [meta])
    expected = torch.tensor([[5.0, 10.0, 15.0, 20.0, 0.8]])

def test_non_float_tensor():
    # Function should work with integer tensors (will cast to float for division)
    det = torch.tensor([[12, 22, 32, 42, 1]])
    meta = PreProcessingMetadata(2, 2, 2, 2)
    expected = torch.tensor([[5.0, 10.0, 15.0, 20.0, 1.0]])
    codeflash_output = rescale_detections([det.clone().float()], [meta]); result = codeflash_output

def test_high_precision():
    # Check that float64 is preserved
    det = torch.tensor([[12.0, 22.0, 32.0, 42.0, 0.8]], dtype=torch.float64)
    meta = PreProcessingMetadata(2, 2, 2, 2)
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

# LARGE SCALE TEST CASES

def test_large_number_of_detections():
    # Test with a large number of detections (e.g., 1000)
    num_dets = 1000
    det = torch.stack([
        torch.tensor([float(i), float(i+1), float(i+2), float(i+3), 0.5])
        for i in range(num_dets)
    ])
    meta = PreProcessingMetadata(1, 2, 2, 4)
    expected = torch.stack([
        torch.tensor([(i-1)/2, (i+1-2)/4, (i+2-1)/2, (i+3-2)/4, 0.5])
        for i in range(num_dets)
    ])
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_large_batch_of_images():
    # Test with a large batch of images (e.g., 500), each with one detection
    batch_size = 500
    dets = [torch.tensor([[float(i), float(i+1), float(i+2), float(i+3), 0.9]]) for i in range(batch_size)]
    metas = [PreProcessingMetadata(i, i+1, 2, 4) for i in range(batch_size)]
    expected = [
        torch.tensor([[(i-i)/2, (i+1-(i+1))/4, (i+2-i)/2, (i+3-(i+1))/4, 0.9]])
        for i in range(batch_size)
    ]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); result = codeflash_output
    for i in range(batch_size):
        pass

def test_memory_efficiency_large_tensor():
    # Test with a single image and a large detection tensor, but under 100MB
    n = 1000  # 1000 detections, 5 values each, float32: 1000*5*4 = 20KB
    det = torch.arange(n*5, dtype=torch.float32).reshape(n,5)
    meta = PreProcessingMetadata(10, 20, 2, 4)
    expected = det.clone()
    expected[:, 0] = (expected[:, 0] - 10) / 2
    expected[:, 1] = (expected[:, 1] - 20) / 4
    expected[:, 2] = (expected[:, 2] - 10) / 2
    expected[:, 3] = (expected[:, 3] - 20) / 4
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output

def test_large_scale_extreme_values():
    # Test with large values and large number of detections
    n = 1000
    det = torch.full((n, 5), 1e6, dtype=torch.float32)
    meta = PreProcessingMetadata(1e5, 2e5, 10, 20)
    expected = torch.full((n, 5), 1e6, dtype=torch.float32)
    expected[:, 0] = (expected[:, 0] - 1e5) / 10
    expected[:, 1] = (expected[:, 1] - 2e5) / 20
    expected[:, 2] = (expected[:, 2] - 1e5) / 10
    expected[:, 3] = (expected[:, 3] - 2e5) / 20
    codeflash_output = rescale_detections([det.clone()], [meta]); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import List

# imports
import pytest  # used for our unit tests
import torch
from inference.v1.models.common.post_processing import rescale_detections


# Minimal PreProcessingMetadata for testing
class PreProcessingMetadata:
    def __init__(self, pad_left, pad_top, scale_width, scale_height):
        self.pad_left = pad_left
        self.pad_top = pad_top
        self.scale_width = scale_width
        self.scale_height = scale_height
from inference.v1.models.common.post_processing import rescale_detections

# --- Unit Tests ---

# 1. Basic Test Cases

def test_single_detection_no_padding_no_scaling():
    # Single detection, no padding, no scaling
    dets = [torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9]])]
    metas = [PreProcessingMetadata(0, 0, 1.0, 1.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output

def test_single_detection_with_padding():
    # Single detection, with padding
    dets = [torch.tensor([[12.0, 25.0, 32.0, 45.0, 0.8]])]
    metas = [PreProcessingMetadata(2, 5, 1.0, 1.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.8]])

def test_single_detection_with_scaling():
    # Single detection, with scaling
    dets = [torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]])]
    metas = [PreProcessingMetadata(0, 0, 2.0, 4.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected = torch.tensor([[5.0, 5.0, 15.0, 10.0, 0.7]])

def test_single_detection_with_padding_and_scaling():
    # Single detection, with both padding and scaling
    dets = [torch.tensor([[12.0, 24.0, 32.0, 44.0, 0.6]])]
    metas = [PreProcessingMetadata(2, 4, 2.0, 4.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected = torch.tensor([[5.0, 5.0, 15.0, 10.0, 0.6]])

def test_multiple_detections_multiple_images():
    # Two images, different paddings and scales, multiple detections
    dets = [
        torch.tensor([[12.0, 24.0, 32.0, 44.0, 0.6], [22.0, 34.0, 42.0, 54.0, 0.5]]),
        torch.tensor([[20.0, 40.0, 60.0, 80.0, 0.9]])
    ]
    metas = [
        PreProcessingMetadata(2, 4, 2.0, 4.0),
        PreProcessingMetadata(10, 20, 5.0, 10.0)
    ]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected0 = torch.tensor([[5.0, 5.0, 15.0, 10.0, 0.6], [10.0, 7.5, 20.0, 12.5, 0.5]])
    expected1 = torch.tensor([[2.0, 2.0, 10.0, 6.0, 0.9]])

# 2. Edge Test Cases

def test_empty_detections_list():
    # No detections, no images
    codeflash_output = rescale_detections([], []); out = codeflash_output

def test_empty_detections_per_image():
    # Image with no detections
    dets = [torch.empty((0, 5))]
    metas = [PreProcessingMetadata(0, 0, 1.0, 1.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output


def test_negative_padding_and_scaling():
    # Negative padding and scaling
    dets = [torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]])]
    metas = [PreProcessingMetadata(-2, -4, -2.0, -4.0)]
    # Should not raise, but output will be negative
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected = torch.tensor([[(10.0+2)/-2.0, (20.0+4)/-4.0, (30.0+2)/-2.0, (40.0+4)/-4.0, 0.7]])

def test_non_square_boxes():
    # Boxes with width != height
    dets = [torch.tensor([[10.0, 20.0, 30.0, 80.0, 0.7]])]
    metas = [PreProcessingMetadata(5, 10, 2.0, 10.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    expected = torch.tensor([[(10.0-5)/2.0, (20.0-10)/10.0, (30.0-5)/2.0, (80.0-10)/10.0, 0.7]])

def test_different_tensor_dtypes_and_devices():
    # float32, float64, cpu, (cuda if available)
    for dtype in [torch.float32, torch.float64]:
        dets = [torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]], dtype=dtype)]
        metas = [PreProcessingMetadata(1, 2, 3.0, 4.0)]
        codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
        expected = torch.tensor([[(10.0-1)/3.0, (20.0-2)/4.0, (30.0-1)/3.0, (40.0-2)/4.0, 0.7]], dtype=dtype)
    if torch.cuda.is_available():
        dets = [torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]], device='cuda')]
        metas = [PreProcessingMetadata(1, 2, 3.0, 4.0)]
        codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
        expected = torch.tensor([[(10.0-1)/3.0, (20.0-2)/4.0, (30.0-1)/3.0, (40.0-2)/4.0, 0.7]], device='cuda')

def test_inplace_modification():
    # Check that the function modifies the input tensors in-place
    det = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.7]])
    det_clone = det.clone()
    metas = [PreProcessingMetadata(1, 2, 3.0, 4.0)]
    rescale_detections([det], metas)
    expected = torch.tensor([[(10.0-1)/3.0, (20.0-2)/4.0, (30.0-1)/3.0, (40.0-2)/4.0, 0.7]])

# 3. Large Scale Test Cases

def test_many_detections_per_image():
    # 1000 detections for one image
    n = 1000
    dets = [torch.cat([torch.arange(1, n+1).unsqueeze(1).float() for _ in range(5)], dim=1)]  # shape (1000,5)
    metas = [PreProcessingMetadata(1, 2, 3.0, 4.0)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); out = codeflash_output
    # Check first and last row
    expected_first = torch.tensor([(1-1)/3.0, (1-2)/4.0, (1-1)/3.0, (1-2)/4.0, 1.0])
    expected_last = torch.tensor([(1000-1)/3.0, (1000-2)/4.0, (1000-1)/3.0, (1000-2)/4.0, 1000.0])

def test_many_images_small_detections():
    # 100 images, each with 2 detections
    n = 100
    dets = [torch.tensor([[i+1.0, i+2.0, i+3.0, i+4.0, i+0.5], [i+5.0, i+6.0, i+7.0, i+8.0, i+0.7]]) for i in range(n)]
    metas = [PreProcessingMetadata(i%3, i%5, (i%4)+1.0, (i%6)+1.0) for i in range(n)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); outs = codeflash_output
    for i in range(n):
        # Check first detection of each image
        expected = torch.tensor([
            (i+1.0 - metas[i].pad_left)/metas[i].scale_width,
            (i+2.0 - metas[i].pad_top)/metas[i].scale_height,
            (i+3.0 - metas[i].pad_left)/metas[i].scale_width,
            (i+4.0 - metas[i].pad_top)/metas[i].scale_height,
            i+0.5
        ])

def test_large_tensor_memory_limit():
    # Tensor size below 100MB (float32: 4 bytes per value)
    # 1000 detections, 5 values each: 1000*5*4 = 20KB, repeat for 100 images: 2MB
    n_imgs = 100
    n_dets = 1000
    dets = [torch.ones((n_dets, 5)) * i for i in range(n_imgs)]
    metas = [PreProcessingMetadata(1, 2, 3.0, 4.0) for _ in range(n_imgs)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); outs = codeflash_output
    # Check a few random indices
    for i in [0, 50, 99]:
        expected = torch.tensor([(i-1)/3.0, (i-2)/4.0, (i-1)/3.0, (i-2)/4.0, i])

def test_large_scale_varied_metadata():
    # 500 images, each with 2 detections, varied metadata
    n = 500
    dets = [torch.tensor([[i+1.0, i+2.0, i+3.0, i+4.0, i+0.5], [i+5.0, i+6.0, i+7.0, i+8.0, i+0.7]]) for i in range(n)]
    metas = [PreProcessingMetadata(i%7, i%11, (i%9)+1.0, (i%8)+1.0) for i in range(n)]
    codeflash_output = rescale_detections([d.clone() for d in dets], metas); outs = codeflash_output
    # Check a few random indices
    for i in [0, 123, 499]:
        expected = torch.tensor([
            (i+1.0 - metas[i].pad_left)/metas[i].scale_width,
            (i+2.0 - metas[i].pad_top)/metas[i].scale_height,
            (i+3.0 - metas[i].pad_left)/metas[i].scale_width,
            (i+4.0 - metas[i].pad_top)/metas[i].scale_height,
            i+0.5
        ])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1250-2025-06-05T15.39.56 and push.

Codeflash

…e/inference-v1-models`)

Here’s an optimized version of your program.  
The line profiling shows the regeneration of 1D tensors (`offsets`, `scale`) and slicing in-place ops are the major time consumers, while all per-image operations—in a tight loop—cause overhead.

### Key ideas.

- Avoid per-row in-place ops. Instead, use a fused operation on the entire tensor stack, if possible.
- Precompute offsets/scale as arrays or tensors in advance, in batch.
- Replace Python `for` loop with vectorization *where possible*.
- Fuse subtraction/division.
- Avoid repeated creation of small tensors.
- Keep compatibility with all input shapes and types.

If your images always have the same metadata, full vectorization is possible.  
If metadata varies *per image* (as profiling suggests), batch vectorization of the first four columns within a single kernel offers the main speed gain.

Below is a much faster version.



---

### **What Changed and Why**
- **No more per-call `torch.as_tensor`** inside the tightest loop: Instead, `offsets` and `scales` for all images are created *once* per batch.
- **In-place operation still leveraged** for memory efficiency; only slicing, subtraction and division happen in the inner-most loop (no new tensor allocations for every detection).
- **Single-image fallback** (`rescale_image_detections`) still works, but consolidated with pre-built tensors (`torch.tensor`, not `as_tensor`) for better perf; this skips small hidden overheads.
- **Handles empty detections robustly**.
- **Compliant function interface and comments preserved.**

---

### **Further potential batch speedups**
If you can pad all detection tensors to the same shape, you can batch-process the entire `detections` list using broadcasting, for further speed.  
For now, this version assumes all detection tensors may have different lengths, which matches your usage pattern.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 5, 2025
@codeflash-ai codeflash-ai bot mentioned this pull request Jun 5, 2025
4 tasks
@grzegorz-roboflow
Copy link
Collaborator

Not relevant anymore, source branch received further updates.

@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1250-2025-06-05T15.39.56 branch June 12, 2025 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants