Realtime stereo depth on Jetson Orin — 2.7× faster than the stock TensorRT pipeline, with bit-identical FP16 output.
FlashStereo is a drop-in replacement for the two-stage TensorRT inference pipeline of FoundationStereo. It keeps every intermediate tensor on the GPU between stages, lifts the per-group L2 norm and group-wise correlation (GWC) volume into hand-written CUDA kernels, and ships a calibration toolchain so you can swap each stage to INT8 at <0.2% disparity error.
The headline result on Jetson AGX Orin (TensorRT 10.3, 480×640 input):
| Configuration | Latency (p50) | Throughput | vs. stock |
|---|---|---|---|
| Stock pipeline (FP16, Python GWC, D2H/H2D bouncing) | 238 ms | 4.20 Hz | 1.00× |
| FlashStereo FP16 (GPU-resident, bit-identical ✓) | 124 ms | 8.07 Hz | 1.92× |
| FlashStereo + INT8 post | 90 ms | 11.17 Hz | 2.66× |
| FlashStereo + INT8 feat + INT8 post | 88 ms | 11.37 Hz | 2.71× |
INT8 disparity quality vs. the FP16 reference: cosine 0.99999, relative L1 0.19% — visually indistinguishable, single-pixel-scale worst case.
The stock FoundationStereo two-stage pipeline pays a ~40 ms tax per frame on a Python-mediated GWC volume builder that does the following dance:
feat_runner (TRT) → device buffers
↓
D2H to numpy (≈ 16 MiB × 2)
CPU np.linalg.norm (35 M FP32 reductions on the CPU)
H2D back to GPU
PyCUDA gwc kernel
D2H to numpy (≈ 28 MiB)
↓
post_runner (TRT)
H2D every input (≈ 70 MiB total)
↓
D2H disparity
That's roughly 100 MiB of needless host↔device traffic and a CPU-bound reduction every frame. FlashStereo rewrites the data flow as GPU-resident:
feat_runner (TRT) ── writes outputs to persistent device buffers
GPU norm kernel ── reads / writes on device
GPU GWC kernel ── reads / writes on device
post_runner (TRT) ── reads feat outputs + GWC volume directly via the
same device pointers (zero-copy chaining)
↓
D2H disparity ← the only mandatory copy per inference
Same engines. Same kernels. Same numerics. Just one host↔device round-trip per frame instead of half a dozen. The FP16 path is bit-identical to the reference (verified by tests/ style stage-by-stage diff: L1 = 0, cosine = 1.000000000 across feat outputs, GWC volume, and final disparity).
INT8 wins are stacked on top: post_runner INT8 alone shaves another 34 ms (the RAFT iterative refinement is dominated by quantizable convolutions); feat_runner INT8 is a smaller ~1 ms gain since it's a much smaller engine.
Requires:
| GPU | Jetson AGX Orin (SM87) or any CUDA device with TensorRT 10.x |
| CUDA | 12.4+ |
| TensorRT | 10.3+ |
| Python | 3.10 / 3.11 / 3.12 |
| Other | numpy, opencv-python, pycuda, tensorrt, torch (only for ONNX export, optional at runtime) |
git clone https://github.com/saofund/flashstereo
cd flashstereo
pip install -e .On Jetson, install PyCUDA + TensorRT Python bindings from the JetPack apt repo (the nvidia-tensorrt and python3-libnvinfer* packages); the rest is plain pip install.
You need two FoundationStereo TensorRT engines: the feature_runner and the post_runner. If you don't have them yet, see the original FoundationStereo deployment guide for the ONNX export steps, or grab a pre-built pair from your own checkpoint hub.
import cv2
from flashstereo import FlashStereoPipeline
pipe = FlashStereoPipeline(
feat_engine_path="path/to/feature_runner.engine",
post_engine_path="path/to/post_runner.engine",
input_h=480, input_w=640,
)
left = cv2.imread("left.png", cv2.IMREAD_GRAYSCALE)
right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)
left_rgb = cv2.cvtColor(left, cv2.COLOR_GRAY2RGB)
right_rgb = cv2.cvtColor(right, cv2.COLOR_GRAY2RGB)
disp = pipe.infer(left_rgb, right_rgb) # (H, W) float32 disparity in pixelsOr as a one-shot demo:
python examples/single_pair_demo.py \
--feat-engine path/to/feature_runner.engine \
--post-engine path/to/post_runner.engine \
--left assets/example_left.png \
--right assets/example_right.png \
--out disp_color.pngA Jetson AGX Orin / TensorRT 10.3 build of the INT8 feature_runner
engine + its portable calibration cache, calibrated on 16 public
Middlebury 2014 Stereo pairs.
TensorRT engines are serialized binaries and are NOT portable. The
pre-built feature_runner_int8.engine loads only on:
| Requirement | Value |
|---|---|
| GPU | Jetson AGX Orin (Ampere, SM87) |
| JetPack | 6.0 |
| TensorRT | 10.3 |
| CUDA | 12.6 |
| Input shape | 480 × 640 |
For any other configuration (Thor SM110, RTX 4090/5090, Orin on
JetPack 6.1+ with TRT 10.4+, different input resolution, …), the engine
file will fail to deserialize. Use the rebuild recipe below — the
.calib file in the HF release is portable and can be reused to
skip the 5-minute INT8 calibration step on any target.
pip install huggingface_hub
# (a) Matching Orin/TRT 10.3 — drop-in usable
huggingface-cli download saofund/flashstereo-int8-orin \
engines/feature_runner_int8.engine \
--local-dir artifacts/
# (b) Any other hardware/TRT version — grab the portable .calib instead
huggingface-cli download saofund/flashstereo-int8-orin \
calib_cache/feature_runner_int8.engine.calib \
--local-dir artifacts/
# then rebuild the engine locally (see "Rebuilding INT8 engines" below),
# pointing --cache at the downloaded .calib to skip recalibration.The matching post_runner_int8.engine is not pre-published — it
takes ~45 min of GPU time to build but is trivial to regenerate from
the same Middlebury pairs (see step 3-4 below). See
docs/HUGGINGFACE.md for the full release /
upload guide.
INT8 calibration is reproducible from any public stereo dataset. The reference recipe uses the Middlebury 2014 Stereo dataset — 23 high-quality lab scenes, free, no auth required.
# 1) Pull 16 stereo pairs from Middlebury 2014 and resize them to 480x640
python scripts/download_calib_data.py \
--out-dir assets/calib_pairs --n 16
# 2) Build INT8 feature_runner engine (~5 min on Orin)
python scripts/build_int8.py \
--onnx path/to/feature_runner.onnx \
--engine-out artifacts/feature_runner_int8.engine \
--calib-dir assets/calib_pairs
# 3) Generate post_runner calibration tensors via the FP16 feat pipeline
python scripts/gen_post_calib_data.py \
--feat-engine path/to/feature_runner.engine \
--post-engine path/to/post_runner.engine \
--calib-dir assets/calib_pairs \
--out-dir artifacts/post_calib
# 4) Build INT8 post_runner engine (~45 min on Orin)
python scripts/build_int8_post.py \
--onnx path/to/post_runner.onnx \
--engine-out artifacts/post_runner_int8.engine \
--npz-dir artifacts/post_calibSingle config:
python scripts/bench.py \
--feat artifacts/feature_runner_int8.engine \
--post artifacts/post_runner_int8.engine \
--left assets/calib_pairs/Motorcycle_left.png \
--right assets/calib_pairs/Motorcycle_right.png \
--reps 20Full precision sweep (4 combinations of FP16/INT8 feat × FP16/INT8 post) with disparity diff against the FP16 baseline:
python scripts/bench.py \
--feat-fp16 path/to/feature_runner.engine \
--feat-int8 artifacts/feature_runner_int8.engine \
--post-fp16 path/to/post_runner.engine \
--post-int8 artifacts/post_runner_int8.engine \
--left assets/calib_pairs/Motorcycle_left.png \
--right assets/calib_pairs/Motorcycle_right.png \
--sweep- No retraining. The model weights are identical. FlashStereo only rewires the data flow between TensorRT engines.
- No new ONNX. It uses the standard FoundationStereo two-stage ONNX exports as-is.
- No accuracy tradeoff in the FP16 path. Bit-identical disparity output, verified end-to-end.
This optimization targets the two-stage variant of FoundationStereo (separate feature_runner + post_runner + Python-mediated GWC volume). If your deployment uses the single-engine "full" variant, this codebase will not help you — profiling shows the single-engine variant spends 99.96% of its time inside the TensorRT engine itself, with no Python overhead left to remove.
FlashStereo stands on the shoulders of two pieces of work:
- FoundationStereo (NVIDIA, 2025) — the stereo depth model and the original two-stage TensorRT deployment recipe.
- FlashRT (
feat/orin-pipelined-streamingbranch) — a CUDA-native realtime inference engine for VLA models. FlashStereo borrows its core design principle: small-batch latency-sensitive inference is best served by hand-written, GPU-resident pipelines, not by general-purpose tactic-search compilers. The specific kernels are different (FlashRT optimizes transformer attention; FlashStereo optimizes a stereo cost volume) but the philosophy is shared.
If you use FlashStereo in academic work, please cite the original FoundationStereo paper:
@inproceedings{wen2025foundationstereo,
title = {FoundationStereo: Zero-Shot Stereo Matching},
author = {Wen, Bowen and Trepte, Matthew and Aribido, Joseph and Kautz, Jan and Gallo, Orazio and Birchfield, Stan},
booktitle = {CVPR},
year = {2025},
}Apache 2.0 — see LICENSE. The bundled GWC and norm CUDA kernels are released under the same license. Model weights and ONNX files are governed by the FoundationStereo license; this repo does not redistribute them.