⚡ FlashStereo

Realtime stereo depth on Jetson Orin — 2.7× faster than the stock TensorRT pipeline, with bit-identical FP16 output.

FlashStereo is a drop-in replacement for the two-stage TensorRT inference pipeline of FoundationStereo. It keeps every intermediate tensor on the GPU between stages, lifts the per-group L2 norm and group-wise correlation (GWC) volume into hand-written CUDA kernels, and ships a calibration toolchain so you can swap each stage to INT8 at <0.2% disparity error.

The headline result on Jetson AGX Orin (TensorRT 10.3, 480×640 input):

Configuration	Latency (p50)	Throughput	vs. stock
Stock pipeline (FP16, Python GWC, D2H/H2D bouncing)	238 ms	4.20 Hz	1.00×
FlashStereo FP16 (GPU-resident, bit-identical ✓)	124 ms	8.07 Hz	1.92×
FlashStereo + INT8 post	90 ms	11.17 Hz	2.66×
FlashStereo + INT8 feat + INT8 post	88 ms	11.37 Hz	2.71×

INT8 disparity quality vs. the FP16 reference: cosine 0.99999, relative L1 0.19% — visually indistinguishable, single-pixel-scale worst case.

✨ Why it's fast

The stock FoundationStereo two-stage pipeline pays a ~40 ms tax per frame on a Python-mediated GWC volume builder that does the following dance:

  feat_runner (TRT) → device buffers
       ↓
       D2H to numpy            (≈ 16 MiB × 2)
  CPU np.linalg.norm           (35 M FP32 reductions on the CPU)
       H2D back to GPU
  PyCUDA gwc kernel
       D2H to numpy            (≈ 28 MiB)
       ↓
  post_runner (TRT)
       H2D every input         (≈ 70 MiB total)
       ↓
       D2H disparity

That's roughly 100 MiB of needless host↔device traffic and a CPU-bound reduction every frame. FlashStereo rewrites the data flow as GPU-resident:

  feat_runner (TRT)  ── writes outputs to persistent device buffers
  GPU norm kernel    ── reads / writes on device
  GPU GWC kernel     ── reads / writes on device
  post_runner (TRT)  ── reads feat outputs + GWC volume directly via the
                        same device pointers (zero-copy chaining)
       ↓
       D2H disparity   ← the only mandatory copy per inference

Same engines. Same kernels. Same numerics. Just one host↔device round-trip per frame instead of half a dozen. The FP16 path is bit-identical to the reference (verified by tests/ style stage-by-stage diff: L1 = 0, cosine = 1.000000000 across feat outputs, GWC volume, and final disparity).

INT8 wins are stacked on top: post_runner INT8 alone shaves another 34 ms (the RAFT iterative refinement is dominated by quantizable convolutions); feat_runner INT8 is a smaller ~1 ms gain since it's a much smaller engine.

📦 Install

Requires:


GPU	Jetson AGX Orin (SM87) or any CUDA device with TensorRT 10.x
CUDA	12.4+
TensorRT	10.3+
Python	3.10 / 3.11 / 3.12
Other	`numpy`, `opencv-python`, `pycuda`, `tensorrt`, `torch` (only for ONNX export, optional at runtime)

git clone https://github.com/saofund/flashstereo
cd flashstereo
pip install -e .

On Jetson, install PyCUDA + TensorRT Python bindings from the JetPack apt repo (the nvidia-tensorrt and python3-libnvinfer* packages); the rest is plain pip install.

🏃 Quickstart

You need two FoundationStereo TensorRT engines: the feature_runner and the post_runner. If you don't have them yet, see the original FoundationStereo deployment guide for the ONNX export steps, or grab a pre-built pair from your own checkpoint hub.

import cv2
from flashstereo import FlashStereoPipeline

pipe = FlashStereoPipeline(
    feat_engine_path="path/to/feature_runner.engine",
    post_engine_path="path/to/post_runner.engine",
    input_h=480, input_w=640,
)

left  = cv2.imread("left.png",  cv2.IMREAD_GRAYSCALE)
right = cv2.imread("right.png", cv2.IMREAD_GRAYSCALE)
left_rgb  = cv2.cvtColor(left,  cv2.COLOR_GRAY2RGB)
right_rgb = cv2.cvtColor(right, cv2.COLOR_GRAY2RGB)

disp = pipe.infer(left_rgb, right_rgb)   # (H, W) float32 disparity in pixels

Or as a one-shot demo:

python examples/single_pair_demo.py \
    --feat-engine path/to/feature_runner.engine \
    --post-engine path/to/post_runner.engine \
    --left  assets/example_left.png \
    --right assets/example_right.png \
    --out   disp_color.png

📦 Pre-built INT8 weights (HuggingFace)

🤗 https://huggingface.co/saofund/flashstereo-int8-orin

A Jetson AGX Orin / TensorRT 10.3 build of the INT8 feature_runner engine + its portable calibration cache, calibrated on 16 public Middlebury 2014 Stereo pairs.

⚠️ Hardware / software compatibility for the pre-built engine

TensorRT engines are serialized binaries and are NOT portable. The pre-built feature_runner_int8.engine loads only on:

Requirement	Value
GPU	Jetson AGX Orin (Ampere, SM87)
JetPack	6.0
TensorRT	10.3
CUDA	12.6
Input shape	480 × 640

For any other configuration (Thor SM110, RTX 4090/5090, Orin on JetPack 6.1+ with TRT 10.4+, different input resolution, …), the engine file will fail to deserialize. Use the rebuild recipe below — the .calib file in the HF release is portable and can be reused to skip the 5-minute INT8 calibration step on any target.

pip install huggingface_hub

# (a) Matching Orin/TRT 10.3 — drop-in usable
huggingface-cli download saofund/flashstereo-int8-orin \
    engines/feature_runner_int8.engine \
    --local-dir artifacts/

# (b) Any other hardware/TRT version — grab the portable .calib instead
huggingface-cli download saofund/flashstereo-int8-orin \
    calib_cache/feature_runner_int8.engine.calib \
    --local-dir artifacts/
# then rebuild the engine locally (see "Rebuilding INT8 engines" below),
# pointing --cache at the downloaded .calib to skip recalibration.

The matching post_runner_int8.engine is not pre-published — it takes ~45 min of GPU time to build but is trivial to regenerate from the same Middlebury pairs (see step 3-4 below). See docs/HUGGINGFACE.md for the full release / upload guide.

🔧 Rebuilding INT8 engines from scratch

INT8 calibration is reproducible from any public stereo dataset. The reference recipe uses the Middlebury 2014 Stereo dataset — 23 high-quality lab scenes, free, no auth required.

# 1) Pull 16 stereo pairs from Middlebury 2014 and resize them to 480x640
python scripts/download_calib_data.py \
    --out-dir assets/calib_pairs --n 16

# 2) Build INT8 feature_runner engine (~5 min on Orin)
python scripts/build_int8.py \
    --onnx path/to/feature_runner.onnx \
    --engine-out artifacts/feature_runner_int8.engine \
    --calib-dir assets/calib_pairs

# 3) Generate post_runner calibration tensors via the FP16 feat pipeline
python scripts/gen_post_calib_data.py \
    --feat-engine path/to/feature_runner.engine \
    --post-engine path/to/post_runner.engine \
    --calib-dir assets/calib_pairs \
    --out-dir artifacts/post_calib

# 4) Build INT8 post_runner engine (~45 min on Orin)
python scripts/build_int8_post.py \
    --onnx path/to/post_runner.onnx \
    --engine-out artifacts/post_runner_int8.engine \
    --npz-dir artifacts/post_calib

📊 Bench it yourself

Single config:

python scripts/bench.py \
    --feat  artifacts/feature_runner_int8.engine \
    --post  artifacts/post_runner_int8.engine \
    --left  assets/calib_pairs/Motorcycle_left.png \
    --right assets/calib_pairs/Motorcycle_right.png \
    --reps  20

Full precision sweep (4 combinations of FP16/INT8 feat × FP16/INT8 post) with disparity diff against the FP16 baseline:

python scripts/bench.py \
    --feat-fp16 path/to/feature_runner.engine \
    --feat-int8 artifacts/feature_runner_int8.engine \
    --post-fp16 path/to/post_runner.engine \
    --post-int8 artifacts/post_runner_int8.engine \
    --left  assets/calib_pairs/Motorcycle_left.png \
    --right assets/calib_pairs/Motorcycle_right.png \
    --sweep

🧠 What it does NOT change

No retraining. The model weights are identical. FlashStereo only rewires the data flow between TensorRT engines.
No new ONNX. It uses the standard FoundationStereo two-stage ONNX exports as-is.
No accuracy tradeoff in the FP16 path. Bit-identical disparity output, verified end-to-end.

🗺️ When not to use FlashStereo

This optimization targets the two-stage variant of FoundationStereo (separate feature_runner + post_runner + Python-mediated GWC volume). If your deployment uses the single-engine "full" variant, this codebase will not help you — profiling shows the single-engine variant spends 99.96% of its time inside the TensorRT engine itself, with no Python overhead left to remove.

🙏 Credits & inspiration

FlashStereo stands on the shoulders of two pieces of work:

FoundationStereo (NVIDIA, 2025) — the stereo depth model and the original two-stage TensorRT deployment recipe.
FlashRT (feat/orin-pipelined-streaming branch) — a CUDA-native realtime inference engine for VLA models. FlashStereo borrows its core design principle: small-batch latency-sensitive inference is best served by hand-written, GPU-resident pipelines, not by general-purpose tactic-search compilers. The specific kernels are different (FlashRT optimizes transformer attention; FlashStereo optimizes a stereo cost volume) but the philosophy is shared.

If you use FlashStereo in academic work, please cite the original FoundationStereo paper:

@inproceedings{wen2025foundationstereo,
  title     = {FoundationStereo: Zero-Shot Stereo Matching},
  author    = {Wen, Bowen and Trepte, Matthew and Aribido, Joseph and Kautz, Jan and Gallo, Orazio and Birchfield, Stan},
  booktitle = {CVPR},
  year      = {2025},
}

📄 License

Apache 2.0 — see LICENSE. The bundled GWC and norm CUDA kernels are released under the same license. Model weights and ONNX files are governed by the FoundationStereo license; this repo does not redistribute them.

_{Benchmarks measured on Jetson AGX Orin 64 GB / JetPack 6.0 / TensorRT 10.3 / 480×640 input / 8-rep p50.}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
flashstereo		flashstereo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ FlashStereo

✨ Why it's fast

📦 Install

🏃 Quickstart

📦 Pre-built INT8 weights (HuggingFace)

⚠️ Hardware / software compatibility for the pre-built engine

🔧 Rebuilding INT8 engines from scratch

📊 Bench it yourself

🧠 What it does NOT change

🗺️ When not to use FlashStereo

🙏 Credits & inspiration

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ FlashStereo

✨ Why it's fast

📦 Install

🏃 Quickstart

📦 Pre-built INT8 weights (HuggingFace)

⚠️ Hardware / software compatibility for the pre-built engine

🔧 Rebuilding INT8 engines from scratch

📊 Bench it yourself

🧠 What it does NOT change

🗺️ When not to use FlashStereo

🙏 Credits & inspiration

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages