-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmarks
Holden Salomon edited this page Jun 13, 2026
·
3 revisions
Inference latency and throughput for winnow's two embedding models, measured on real hardware. Run with the scripts/benchmark.py script (included in the repo).
| CPU | AMD / Intel (see per-run notes) |
| GPU | NVIDIA GeForce RTX 2070 SUPER |
| Driver | 570.172.08 |
| Container CUDA | 12.8.1 |
| Host OS | Debian 13 (TrueNAS LXC) |
Detection (RetinaFace det_10g) + ArcFace embedding (w600k_r50) on a 640×640 image. This is the pipeline winnow runs for every face crop it evaluates.
| Mode | Model load | Median latency | Throughput |
|---|---|---|---|
GPU (FORCE_CPU=false) |
2.5 s | 12.8 ms | 78.3 img/s |
CPU (FORCE_CPU=true) |
4.3 s | 102 ms | 9.8 img/s |
- Latency measured over 30 runs after 5 warmup iterations.
- Input: 640×640 synthetic image. The detection network processes the full input regardless of whether faces are found; timing is representative of real-world single-image throughput.
- 320×320 input: CPU 101 ms, GPU 13.4 ms — detection runtime is dominated by fixed model overhead, not image size at these resolutions.
- GPU is 8× faster than CPU for InsightFace (12.8 ms vs 102 ms).
google/siglip-base-patch16-224 — 224×224 Vision Transformer used for object-mode diversity selection. Supports batched inference; GPU benefit scales with batch size.
| Batch | ms/batch | ms/img | img/s | p95/img |
|---|---|---|---|---|
| 1 | 13.1 | 13.13 | 76.1 | 14.2 |
| 4 | 24.4 | 6.10 | 163.9 | 6.2 |
| 8 | 45.4 | 5.67 | 176.3 | 5.7 |
| 16 | 87.7 | 5.48 | 182.4 | 5.5 |
| 32 | 171.5 | 5.36 | 186.6 | 5.4 |
| Batch | ms/batch | ms/img | img/s | p95/img |
|---|---|---|---|---|
| 1 | 216 | 216 | 4.6 | 243 |
| 4 | 757 | 189 | 5.3 | 192 |
| 8 | 1450 | 181 | 5.5 | 202 |
| 16 | 2846 | 178 | 5.6 | 188 |
| 32 | 5683 | 178 | 5.6 | 188 |
Model load: 16.1 s (CPU; first load, no cache)
CPU batching saturates quickly — throughput barely improves past batch 4 (~5.5 img/s ceiling). GPU shows 33× speedup at batch 32 (186 img/s vs 5.6 img/s).
# Inside the container — GPU mode:
docker exec winnow python /app/scripts/benchmark.py
# CPU-only mode:
docker exec -e FORCE_CPU=true winnow python /app/scripts/benchmark.py
# Or directly with docker run:
docker run --rm --gpus all \
--entrypoint /app/.venv/bin/python \
-v /your/models:/insightface \
-e INSIGHTFACE_HOME=/insightface \
ghcr.io/sudolulo/winnow:latest \
/app/scripts/benchmark.py