Skip to content

[codex] Add memory-safe OWLv2 preprocessing#2304

Draft
hansent wants to merge 2 commits intomainfrom
codex/memory-safe-owlv2-preprocessing
Draft

[codex] Add memory-safe OWLv2 preprocessing#2304
hansent wants to merge 2 commits intomainfrom
codex/memory-safe-owlv2-preprocessing

Conversation

@hansent
Copy link
Copy Markdown
Collaborator

@hansent hansent commented Apr 30, 2026

Summary

This PR adds a memory-safe OWLv2 image preprocessing path for the new inference_models OWLv2/Roboflow Instant implementation.

Instead of handing target images directly to the Hugging Face OWLv2 image processor, we now resize the image to the model input envelope first, then pad to the configured OWLv2 input size and normalize using the processor config. This preserves the bounded model input size while avoiding the very large square intermediate that HF can allocate for extreme-aspect-ratio images.

It also updates OWLv2 image hashing to hash a byte view for contiguous arrays instead of calling .tobytes(), avoiding an extra full-resolution copy before preprocessing.

Why

We have seen hosted serverless inference containers get OOM-killed when Roboflow Instant receives very high-resolution or extreme-aspect-ratio images. One concrete example was an 800x15078 image.

The issue is not that final OWLv2 embeddings become enormous: OWLv2 ultimately runs on a fixed-size input. The problem is the preprocessing path. The HF OWLv2 processor pads to a square based on the longest side before resizing. For 800x15078, that can transiently create a 15078x15078 image-like intermediate before the image is resized down to the model input size. That intermediate is hundreds of MiB as uint8 and can become multiple GiB once float/intermediate processor buffers are involved. In serverless, that can kill the container before Python has a chance to return a controlled error.

By resizing before padding, the same 800x15078 image is reduced into the OWLv2 input envelope first, then padded at the fixed target resolution. For the default 1008x1008 OWLv2 size, the content becomes roughly 53x1008 before fixed-size padding, rather than allocating a 15078x15078 square.

Notes

  • This intentionally does not change Roboflow Instant target-image embedding cache behavior.
  • This still assumes the request image has already been decoded by the adapter. A separate pre-decode byte/pixel guard would be the next hard protection for truly oversized request payloads.
  • Text processing and OWLv2 post-processing continue to use the HF processor.

Validation

  • python -B -m py_compile inference_models/inference_models/models/owlv2/owlv2_hf.py inference_models/inference_models/models/owlv2/reference_dataset.py inference_models/tests/unit_tests/models/owlv2/__init__.py inference_models/tests/unit_tests/models/owlv2/test_memory_safe_preprocessing.py
  • git diff --cached --check

Could not run pytest in this local shell because pytest and the model dependency stack are not installed here (numpy, torch, cv2, and transformers are unavailable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant