Skip to content

perf: Triton pre-processing kernel#2328

Open
aseembits93 wants to merge 10 commits into
roboflow:mainfrom
aseembits93:perf/rfdetr-seg-triton-widen-scope
Open

perf: Triton pre-processing kernel#2328
aseembits93 wants to merge 10 commits into
roboflow:mainfrom
aseembits93:perf/rfdetr-seg-triton-widen-scope

Conversation

@aseembits93
Copy link
Copy Markdown
Contributor

@aseembits93 aseembits93 commented May 12, 2026

What does this PR do?

Introducing a single Triton CUDA kernel which executes image pre-processing as a single kernel, eliminating kernel launch overheads and reducing CPU<->GPU memory transfers to the bare minimum. The kernel supports all options EXCEPT dataset_version_resize (two-stage resize: cv2 dataset-version resize → PIL stretch) and contrast (three distinct algorithms: histogram eq / CLAHE / contrast stretching; each needs its own prepass kernel).

Type of Change

  • New feature (non-breaking change that adds functionality)

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details: (Make sure triton is installed in your environment)

  • Performance gains on TensorRT video input. Run twice with
USE_TRITON_FOR_PREPROCESSING="false" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

USE_TRITON_FOR_PREPROCESSING="true" python development/stream_interface/rfdetr_nano_seg_trt_workflow.py --video_reference vehicles_312px.mp4

vehicles_312px.mp4 (538 frames, src 312×176): T4 GPU

fps ms/frame
PIL reference (env=false) 76.25 13.11
Triton fast path (env=true) 99.83 10.02
Δ +31% −3.09 ms

vehicles_1080p.mp4 (538 frames, src 1920×1080 — preproc has real resize work): T4 GPU

fps elapsed
PIL reference (env=false) 14.05 38.29 s
Triton fast path (env=true) 21.34 25.21 s
Δ +52% −13.1 s
  • Correctness guarantees on COCOval2017. Make sure have the coco/ in the current working directory. Run python temp/detection_parity_full.py
Triton fast path (env=true) PIL reference (env=false)
Triton kernel calls 5000 / 5000 0
Detections 26,721 26,721
Matched at IoU>0.5 26,721 (100%)
Mean box IoU 1.000000
Mean |Δscore| 0.000e+00
Class-id disagreements 0
Pixel-identical masks 26,721 / 26,721
  • Preproc microbench (isolated pre_process()) T4 GPU
USE_TRITON_FOR_PREPROCESSING="false"  python temp/preproc_microbench.py
USE_TRITON_FOR_PREPROCESSING="true" python temp/preproc_microbench.py
src → 312² PIL (env=false) Triton (env=true)
312×312 1.96 ms 0.29 ms (~7×)
720×1280 13.52 ms 1.63 ms (~8×)
1080×1920 28.03 ms 2.70 ms (~10×)
  • Side-by-Side visual comparison

https://drive.google.com/file/d/1aXWk0hgTsMsfUDwqF7wxqK9YrgMUPH09/view?usp=sharing

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

Replace the per-frame PIL-bilinear-antialias + to_tensor + normalize chain
in the RF-DETR TRT instance-segmentation model with a single Triton
kernel that resizes, swaps BGR↔RGB, scales by 1/255, and applies
ImageNet normalization — writing straight into the preallocated TRT
input buffer.

Byte-exact port of PIL's separable bilinear-antialias resize
(PRECISION_BITS=22, int32 fixed-point, uint8 quantization between the
horizontal and vertical passes). The horizontal uint8 intermediate
lives in registers.

Correctness
- Preproc max abs error vs PIL: 4.77e-7 (fp32 ULP on the final
  /255+normalize step; the uint8 resize result is byte-identical).
- Full coco/val2017 detection parity (rfdetr-seg-nano, conf=0.4):
  26,721 / 26,721 matched at IoU>0.5, mean box IoU 1.0000,
  |Δscore| 0, 0 class-id disagreements, all matched masks
  pixel-identical.

Performance (vehicles_312px.mp4, 538 frames)
- Baseline (PIL path): 76.25 fps
- Triton fast path:    99.83 fps (+31%)
- Preproc microbench (1080p → 312²): 27.0 ms → 2.8 ms per frame (~10×)

Scope
- Gated on: single-image numpy uint8 HWC input, stretch/letterbox/
  center-crop/letterbox-reflect resize modes (all collapse to a single
  PIL stretch when dataset_version_resize_dimensions is None, verified
  via synthetic-package test), no static_crop/grayscale/contrast,
  3-channel, scaling_factor in {None, 255}, normalization set.
- Falls back to the existing PIL-based pre_process_network_input
  when any precondition fails.

Also adds the benchmark driver
development/stream_interface/rfdetr_nano_seg_trt_workflow.py used to
measure the above numbers.
…hed input

Move the Triton fast-path gate from RFDetrForInstanceSegmentationTRT into
pre_process_network_input so all six RFDetr classes (seg×{TRT,ONNX,Torch}
and detect×{TRT,ONNX,Torch}) can hit it, and widen the predicate to
accept torch uint8 HWC tensors on any device plus batched inputs
(list[ndarray], list[Tensor], 4D ndarray/Tensor — the outer function
already unbinds those to lists before the per-item check).

Color-swap parity fix: the PIL path does `image[:, :, ::-1]` whenever
`input_color_mode != network_input.color_mode`, which is True for an
unspecified caller (None). The old fast-path treated None as BGR and
skipped the swap when the network was also BGR — byte-identical to PIL
for packaged seg models but diverged from PIL on og-rfdetr-base
(ColorMode.BGR network with None caller). Align the kernel swap
condition with PIL's.

Integration coverage (144 tests, CUDA 13):
  baseline:  4 tests hit fast path,  6 / 160 pre_process calls
  widened : 35 tests hit fast path, 43 / 166 pre_process calls
  pass rate unchanged: 144 / 144.

Remaining ~100 tests miss on predicate categories that require kernel
extensions (static_crop, contrast, dataset_version_resize) and are
tracked as follow-up work.
Apply static_crop as a load-time offset in the kernel (+ crop-dims-based
resample tables), matching apply_static_crop_to_numpy_image's pixel-
coordinate percentage math. Extends fast-path coverage from 35 → 55 of
144 rfdetr integration tests, pass rate unchanged (144/144).
torchvision.io.read_image returns CHW uint8 — 72 test calls in the
integration suite arrive in that layout. Mirror _tensor_to_hwc_uint8's
CHW heuristic (first dim in {1,3,4} and last dim not in {1,3,4}) and
permute to HWC before the kernel. Integration coverage 55 → 113 tests
hitting fast path, zero regressions.
INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (default true). Setting
it to false short-circuits _fast_path_eligible so every call falls
back to the PIL reference path — useful for A/B benchmarking and as an
escape hatch if the fused kernel is ever implicated in a regression.

Verified on rfdetr-seg-nano: the kernel fires on every eligible call
when env=true (5000/5000 on full coco/val2017) and never fires when
env=false (0/5000), with byte-identical predictions in both states.

e2e on vehicles_312px.mp4 (538 frames, rfdetr-seg-nano TRT):
  env=true : 99.3 fps
  env=false: 76.2 fps
All scripts are driven by INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED
(true → Triton fast path, false → PIL reference). Each run prints a
Triton kernel invocation count so it is visible from the console which
path handled each image.

Scripts:
- parity_triton_vs_pil.py — kernel-vs-PIL fp32 ULP check (20 imgs, direct
                            kernel call; bypasses the model stack)
- detection_parity_full.py — 5000-img end-to-end parity driver. Spawns
                             one subprocess per env value (so the module-
                             level env read is re-done), pickles per-image
                             detections + Triton call count, then compares.
- parity_env_var.py        — same idea at 100 imgs, a quick sanity run.
- coco_map.py              — bbox + segm mAP via pycocotools; run twice
                             with env=true/false to confirm mAP matches
                             to 4 decimals.
- preproc_microbench.py    — isolated pre_process() timing at
                             312² / 720×1280 / 1080×1920.
- _fastpath_trace.py       — shared instrumentation helper. Patches
                             _fast_path_eligible + _fast_path_preprocess +
                             triton_preprocess_rfdetr_stretch + the two
                             PIL fallbacks and prints per-surface call
                             counts at exit. Used by `python run_with_trace.py
                             <script>` for independent kill-switch audits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant