perf: Triton pre-processing kernel#2328
Open
aseembits93 wants to merge 10 commits into
Open
Conversation
Replace the per-frame PIL-bilinear-antialias + to_tensor + normalize chain
in the RF-DETR TRT instance-segmentation model with a single Triton
kernel that resizes, swaps BGR↔RGB, scales by 1/255, and applies
ImageNet normalization — writing straight into the preallocated TRT
input buffer.
Byte-exact port of PIL's separable bilinear-antialias resize
(PRECISION_BITS=22, int32 fixed-point, uint8 quantization between the
horizontal and vertical passes). The horizontal uint8 intermediate
lives in registers.
Correctness
- Preproc max abs error vs PIL: 4.77e-7 (fp32 ULP on the final
/255+normalize step; the uint8 resize result is byte-identical).
- Full coco/val2017 detection parity (rfdetr-seg-nano, conf=0.4):
26,721 / 26,721 matched at IoU>0.5, mean box IoU 1.0000,
|Δscore| 0, 0 class-id disagreements, all matched masks
pixel-identical.
Performance (vehicles_312px.mp4, 538 frames)
- Baseline (PIL path): 76.25 fps
- Triton fast path: 99.83 fps (+31%)
- Preproc microbench (1080p → 312²): 27.0 ms → 2.8 ms per frame (~10×)
Scope
- Gated on: single-image numpy uint8 HWC input, stretch/letterbox/
center-crop/letterbox-reflect resize modes (all collapse to a single
PIL stretch when dataset_version_resize_dimensions is None, verified
via synthetic-package test), no static_crop/grayscale/contrast,
3-channel, scaling_factor in {None, 255}, normalization set.
- Falls back to the existing PIL-based pre_process_network_input
when any precondition fails.
Also adds the benchmark driver
development/stream_interface/rfdetr_nano_seg_trt_workflow.py used to
measure the above numbers.
…hed input
Move the Triton fast-path gate from RFDetrForInstanceSegmentationTRT into
pre_process_network_input so all six RFDetr classes (seg×{TRT,ONNX,Torch}
and detect×{TRT,ONNX,Torch}) can hit it, and widen the predicate to
accept torch uint8 HWC tensors on any device plus batched inputs
(list[ndarray], list[Tensor], 4D ndarray/Tensor — the outer function
already unbinds those to lists before the per-item check).
Color-swap parity fix: the PIL path does `image[:, :, ::-1]` whenever
`input_color_mode != network_input.color_mode`, which is True for an
unspecified caller (None). The old fast-path treated None as BGR and
skipped the swap when the network was also BGR — byte-identical to PIL
for packaged seg models but diverged from PIL on og-rfdetr-base
(ColorMode.BGR network with None caller). Align the kernel swap
condition with PIL's.
Integration coverage (144 tests, CUDA 13):
baseline: 4 tests hit fast path, 6 / 160 pre_process calls
widened : 35 tests hit fast path, 43 / 166 pre_process calls
pass rate unchanged: 144 / 144.
Remaining ~100 tests miss on predicate categories that require kernel
extensions (static_crop, contrast, dataset_version_resize) and are
tracked as follow-up work.
Apply static_crop as a load-time offset in the kernel (+ crop-dims-based resample tables), matching apply_static_crop_to_numpy_image's pixel- coordinate percentage math. Extends fast-path coverage from 35 → 55 of 144 rfdetr integration tests, pass rate unchanged (144/144).
torchvision.io.read_image returns CHW uint8 — 72 test calls in the
integration suite arrive in that layout. Mirror _tensor_to_hwc_uint8's
CHW heuristic (first dim in {1,3,4} and last dim not in {1,3,4}) and
permute to HWC before the kernel. Integration coverage 55 → 113 tests
hitting fast path, zero regressions.
INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED (default true). Setting it to false short-circuits _fast_path_eligible so every call falls back to the PIL reference path — useful for A/B benchmarking and as an escape hatch if the fused kernel is ever implicated in a regression. Verified on rfdetr-seg-nano: the kernel fires on every eligible call when env=true (5000/5000 on full coco/val2017) and never fires when env=false (0/5000), with byte-identical predictions in both states. e2e on vehicles_312px.mp4 (538 frames, rfdetr-seg-nano TRT): env=true : 99.3 fps env=false: 76.2 fps
All scripts are driven by INFERENCE_MODELS_RFDETR_TRITON_PREPROC_ENABLED
(true → Triton fast path, false → PIL reference). Each run prints a
Triton kernel invocation count so it is visible from the console which
path handled each image.
Scripts:
- parity_triton_vs_pil.py — kernel-vs-PIL fp32 ULP check (20 imgs, direct
kernel call; bypasses the model stack)
- detection_parity_full.py — 5000-img end-to-end parity driver. Spawns
one subprocess per env value (so the module-
level env read is re-done), pickles per-image
detections + Triton call count, then compares.
- parity_env_var.py — same idea at 100 imgs, a quick sanity run.
- coco_map.py — bbox + segm mAP via pycocotools; run twice
with env=true/false to confirm mAP matches
to 4 decimals.
- preproc_microbench.py — isolated pre_process() timing at
312² / 720×1280 / 1080×1920.
- _fastpath_trace.py — shared instrumentation helper. Patches
_fast_path_eligible + _fast_path_preprocess +
triton_preprocess_rfdetr_stretch + the two
PIL fallbacks and prints per-surface call
counts at exit. Used by `python run_with_trace.py
<script>` for independent kill-switch audits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Introducing a single Triton CUDA kernel which executes image pre-processing as a single kernel, eliminating kernel launch overheads and reducing CPU<->GPU memory transfers to the bare minimum. The kernel supports all options EXCEPT
dataset_version_resize(two-stage resize: cv2 dataset-version resize → PIL stretch) andcontrast(three distinct algorithms: histogram eq / CLAHE / contrast stretching; each needs its own prepass kernel).Type of Change
Testing
Test details: (Make sure triton is installed in your environment)
vehicles_312px.mp4 (538 frames, src 312×176): T4 GPU
vehicles_1080p.mp4 (538 frames, src 1920×1080 — preproc has real resize work): T4 GPU
coco/in the current working directory. Runpython temp/detection_parity_full.pypre_process()) T4 GPUhttps://drive.google.com/file/d/1aXWk0hgTsMsfUDwqF7wxqK9YrgMUPH09/view?usp=sharing
Checklist
Additional Context