perf: Triton pre-processing kernel by aseembits93 · Pull Request #2328 · roboflow/inference

aseembits93 · 2026-05-12T05:52:01Z

What does this PR do?

Introducing a single Triton CUDA kernel which executes image pre-processing as a single kernel, eliminating kernel launch overheads and reducing CPU<->GPU memory transfers to the bare minimum. The kernel supports all options EXCEPT dataset_version_resize (two-stage resize: cv2 dataset-version resize → PIL stretch) and contrast (three distinct algorithms: histogram eq / CLAHE / contrast stretching; each needs its own prepass kernel).

Type of Change

New feature (non-breaking change that adds functionality)

Testing

I have tested this change locally
I have added/updated tests for this change

Test details: (Make sure triton is installed in your environment)

Performance gains on TensorRT video input. Run twice with

USE_TRITON_FOR_PREPROCESSING="false" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

USE_TRITON_FOR_PREPROCESSING="true" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

vehicles_312px.mp4 (538 frames, src 312×176): T4 GPU

	fps	ms/frame
PIL reference (env=false)	76.25	13.11
Triton fast path (env=true)	99.83	10.02
Δ	+31%	−3.09 ms

vehicles_1080p.mp4 (538 frames, src 1920×1080 — preproc has real resize work): T4 GPU

	fps	elapsed
PIL reference (env=false)	14.05	38.29 s
Triton fast path (env=true)	21.34	25.21 s
Δ	+52%	−13.1 s

Correctness guarantees on COCOval2017. Make sure have the coco/ in the current working directory. Run python temp/detection_parity_full.py

	Triton fast path (env=true)	PIL reference (env=false)
Triton kernel calls	5000 / 5000	0
Detections	26,721	26,721
Matched at IoU>0.5	26,721 (100%)	—
Mean box IoU	1.000000	—
Mean \|Δscore\|	0.000e+00	—
Class-id disagreements	0	—
Pixel-identical masks	26,721 / 26,721	—

Preproc microbench (isolated pre_process()) T4 GPU

USE_TRITON_FOR_PREPROCESSING="false"  python temp/preproc_microbench.py
USE_TRITON_FOR_PREPROCESSING="true" python temp/preproc_microbench.py

src → 312²	PIL (env=false)	Triton (env=true)
312×312	1.96 ms	0.29 ms (~7×)
720×1280	13.52 ms	1.63 ms (~8×)
1080×1920	28.03 ms	2.70 ms (~10×)

Side-by-Side visual comparison

https://drive.google.com/file/d/1aXWk0hgTsMsfUDwqF7wxqK9YrgMUPH09/view?usp=sharing

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

Replace the per-frame PIL-bilinear-antialias + to_tensor + normalize chain in the RF-DETR TRT instance-segmentation model with a single Triton kernel that resizes, swaps BGR↔RGB, scales by 1/255, and applies ImageNet normalization — writing straight into the preallocated TRT input buffer. Byte-exact port of PIL's separable bilinear-antialias resize (PRECISION_BITS=22, int32 fixed-point, uint8 quantization between the horizontal and vertical passes). The horizontal uint8 intermediate lives in registers. Correctness - Preproc max abs error vs PIL: 4.77e-7 (fp32 ULP on the final /255+normalize step; the uint8 resize result is byte-identical). - Full coco/val2017 detection parity (rfdetr-seg-nano, conf=0.4): 26,721 / 26,721 matched at IoU>0.5, mean box IoU 1.0000, |Δscore| 0, 0 class-id disagreements, all matched masks pixel-identical. Performance (vehicles_312px.mp4, 538 frames) - Baseline (PIL path): 76.25 fps - Triton fast path: 99.83 fps (+31%) - Preproc microbench (1080p → 312²): 27.0 ms → 2.8 ms per frame (~10×) Scope - Gated on: single-image numpy uint8 HWC input, stretch/letterbox/ center-crop/letterbox-reflect resize modes (all collapse to a single PIL stretch when dataset_version_resize_dimensions is None, verified via synthetic-package test), no static_crop/grayscale/contrast, 3-channel, scaling_factor in {None, 255}, normalization set. - Falls back to the existing PIL-based pre_process_network_input when any precondition fails. Also adds the benchmark driver development/stream_interface/rfdetr_nano_seg_trt_workflow.py used to measure the above numbers.

…hed input Move the Triton fast-path gate from RFDetrForInstanceSegmentationTRT into pre_process_network_input so all six RFDetr classes (seg×{TRT,ONNX,Torch} and detect×{TRT,ONNX,Torch}) can hit it, and widen the predicate to accept torch uint8 HWC tensors on any device plus batched inputs (list[ndarray], list[Tensor], 4D ndarray/Tensor — the outer function already unbinds those to lists before the per-item check). Color-swap parity fix: the PIL path does `image[:, :, ::-1]` whenever `input_color_mode != network_input.color_mode`, which is True for an unspecified caller (None). The old fast-path treated None as BGR and skipped the swap when the network was also BGR — byte-identical to PIL for packaged seg models but diverged from PIL on og-rfdetr-base (ColorMode.BGR network with None caller). Align the kernel swap condition with PIL's. Integration coverage (144 tests, CUDA 13): baseline: 4 tests hit fast path, 6 / 160 pre_process calls widened : 35 tests hit fast path, 43 / 166 pre_process calls pass rate unchanged: 144 / 144. Remaining ~100 tests miss on predicate categories that require kernel extensions (static_crop, contrast, dataset_version_resize) and are tracked as follow-up work.

Apply static_crop as a load-time offset in the kernel (+ crop-dims-based resample tables), matching apply_static_crop_to_numpy_image's pixel- coordinate percentage math. Extends fast-path coverage from 35 → 55 of 144 rfdetr integration tests, pass rate unchanged (144/144).

torchvision.io.read_image returns CHW uint8 — 72 test calls in the integration suite arrive in that layout. Mirror _tensor_to_hwc_uint8's CHW heuristic (first dim in {1,3,4} and last dim not in {1,3,4}) and permute to HWC before the kernel. Integration coverage 55 → 113 tests hitting fast path, zero regressions.

INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (default true). Setting it to false short-circuits _fast_path_eligible so every call falls back to the PIL reference path — useful for A/B benchmarking and as an escape hatch if the fused kernel is ever implicated in a regression. Verified on rfdetr-seg-nano: the kernel fires on every eligible call when env=true (5000/5000 on full coco/val2017) and never fires when env=false (0/5000), with byte-identical predictions in both states. e2e on vehicles_312px.mp4 (538 frames, rfdetr-seg-nano TRT): env=true : 99.3 fps env=false: 76.2 fps

All scripts are driven by INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (true → Triton fast path, false → PIL reference). Each run prints a Triton kernel invocation count so it is visible from the console which path handled each image. Scripts: - parity_triton_vs_pil.py — kernel-vs-PIL fp32 ULP check (20 imgs, direct kernel call; bypasses the model stack) - detection_parity_full.py — 5000-img end-to-end parity driver. Spawns one subprocess per env value (so the module- level env read is re-done), pickles per-image detections + Triton call count, then compares. - parity_env_var.py — same idea at 100 imgs, a quick sanity run. - coco_map.py — bbox + segm mAP via pycocotools; run twice with env=true/false to confirm mAP matches to 4 decimals. - preproc_microbench.py — isolated pre_process() timing at 312² / 720×1280 / 1080×1920. - _fastpath_trace.py — shared instrumentation helper. Patches _fast_path_eligible + _fast_path_preprocess + triton_preprocess_rfdetr_stretch + the two PIL fallbacks and prints per-surface call counts at exit. Used by `python run_with_trace.py <script>` for independent kill-switch audits.

aseembits93 added 9 commits May 9, 2026 00:30

rename flag and move to env.py

06a02b2

almost ready

338feeb

removing temp benchmark and sanity check files

63522cc

aseembits93 requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, hansent, probicheaux, rafel-roboflow and yeldarby as code owners May 12, 2026 05:52

Merge branch 'main' into perf/rfdetr-seg-triton-widen-scope

ecc2345

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Triton pre-processing kernel#2328

perf: Triton pre-processing kernel#2328
aseembits93 wants to merge 10 commits into
roboflow:mainfrom
aseembits93:perf/rfdetr-seg-triton-widen-scope

aseembits93 commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aseembits93 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Testing

Checklist

Additional Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aseembits93 commented May 12, 2026 •

edited

Loading