perf(patchworkpp): kill per-patch heap traffic in R-VPF / R-GPF (+14.8% Hz) by LimHyungTae · Pull Request #100 · url-kaist/patchwork-plusplus

LimHyungTae · 2026-05-23T04:32:06Z

Summary

Cuts per-frame Patchwork++ time on KITTI seq 00 from 10.26 ms → 8.94 ms (97.5 Hz → 111.9 Hz, +14.8% Hz, median of 3 runs, i7-12700). Patchwork classic is unaffected (already TBB-bound, see #95).

The win comes from killing short-lived heap allocations in the per-patch hot loop, exactly the bottleneck identified in #96.

What changed

High-impact (this is where the 14.8% comes from):

PatchWorkpp::estimate_plane: drop Eigen::MatrixX3f eigen_ground + centered + centered.adjoint() * centered. Replace with a single-pass scalar accumulation of mean and 9 cross-products, then build the 3×3 cov on the stack. No more per-call Eigen heap allocations.
PatchWorkpp::extract_piecewiseground: promote src_wo_verticals and src_tmp to reused instance scratch members. vector::clear() keeps capacity, so per-patch malloc pressure on the glibc heap (which was serializing the loop per Why Patchwork++ has no TBB parallelisation mode #96) drops away after the first few patches.
PatchWorkpp::estimateGround main loop: auto& zone instead of auto zone for ConcentricZoneModel_[zone_idx]. Avoids a deep-copy of the full 3-level nested vector per outer iteration. Safe: the in-place std::sort mutates patches that are read once and then flushed at the top of every estimateGround call.

Lower-impact, kept for cleanliness:

JacobiSVD<Matrix3f> → SelfAdjointEigenSolver::computeDirect for the 3×3 PSD covariance in both cpp/common/src/plane_fit.cpp and the in-place PatchWorkpp::estimate_plane. Closed-form, no Jacobi iterations. singular_values_ is repacked descending so every consumer (linearity/planarity in common, flatness_thr index (2) in patchwork classic, ground_flatness=minCoeff() and line_variable=sv(0)/sv(1) in patchwork++) keeps the same convention bit-for-bit.
const& on addCloud's add parameter, RevertCandidate loop vars, and the temporal_ground_revert / calc_point_to_plane_d / calc_mean_stdev signatures.

Benchmarks

KITTI seq 00, 2900 timed frames, median of 3 runs:

Stage	patchwork++	patchwork (classic)
baseline (JacobiSVD + per-call allocs)	10.26 ms / 97.5 Hz	4.26 ms / 234.7 Hz
eigh only	9.94 ms / 100.6 Hz (+3.2%)	4.30 ms / 232.8 Hz (noise)
eigh + alloc-free	8.94 ms / 111.9 Hz (+14.8%)	4.29 ms / 232.9 Hz (noise)

Patchwork classic is unchanged on the perf side: TBB parallel_for already amortizes allocations across 24 cores and SVD is sub-µs/patch.

Numerical equivalence

Full KITTI seq 00 (4541 frames):

method (protocol)	before	after	Δ
patchwork (pw)	P 92.34, R 94.64, F1 93.41	P 92.34, R 94.64, F1 93.41	0.00
patchwork++ (pp)	P 94.88, R 98.47, F1 96.62	P 94.89, R 98.48, F1 96.63	+0.01 F1

Algebraic identity of JacobiSVD vs eigh verified on 500 real KITTI patch covariances: normal_ (up to sign), singular_values_, linearity_, planarity_, ground_flatness, and line_variable all match to FP precision. All deltas well within the ±0.05 macro / ±0.10 per-seq budget.

Refs #96.

Test plan

pip install -v ./python/ builds clean on Linux (gcc 13)
evaluate_semantickitti.py --method patchworkpp --eval_protocol patchworkpp --seqs 00 matches baseline within ±0.05 F1
evaluate_semantickitti.py --method patchwork --eval_protocol patchwork --seqs 00 bit-identical to baseline
bench_hz.py --method patchworkpp --seq 00: +14.8% Hz, stable across 3 runs
bench_hz.py --method patchwork --seq 00: within run-to-run noise of baseline
CI matrix (Linux/macOS/Windows wheels + ros2_node + cpp_api) green

…8% Hz) Three changes inside PatchWorkpp::extract_piecewiseground and PatchWorkpp::estimate_plane that together take KITTI seq 00 from 97.5 Hz to 111.9 Hz (median, 24-core i7-12700, 2900 timed frames): * estimate_plane: replace MatrixX3f eigen_ground / centered / centered.adjoint() * centered with a single-pass scalar accumulation of mean and 9 cross-products, then build the 3x3 cov on the stack. No more per-call Eigen heap allocations. * extract_piecewiseground: promote src_wo_verticals and src_tmp to reused instance scratch members. Per-patch malloc pressure on the glibc heap (which was serialising the loop, see issue #96) goes away after the first few patches because vector::clear() retains capacity. * estimateGround main loop: `auto& zone` instead of `auto zone` for the ConcentricZoneModel_[zone_idx] read. Avoids a deep-copy of the full 3-level nested vector per outer iteration; the in-place std::sort is safe because each (zone, ring, sector) patch is read once and the CZM is flushed at the top of every estimateGround call. Also (smaller wins, kept for cleanliness): * JacobiSVD<Matrix3f> -> SelfAdjointEigenSolver::computeDirect on the 3x3 PSD covariance in both cpp/common/src/plane_fit.cpp and the patchworkpp in-place estimate_plane. Closed-form, no Jacobi iterations. singular_values_ is repacked descending so every consumer (linearity_ / planarity_ in common, flatness_thr index (2) in patchwork classic, ground_flatness=minCoeff() and line_variable=sv(0)/sv(1) in patchwork++) keeps the same convention bit-for-bit. * const& on addCloud's 'add' parameter, RevertCandidate loop variables, and the temporal_ground_revert / calc_point_to_plane_d / calc_mean_stdev signatures. Numerical equivalence verified end-to-end on KITTI seq 00: * patchwork (pw protocol): P/R/F1 unchanged to 0.01 (bit-identical). * patchwork++ (pp protocol): F1 96.62 -> 96.63 (delta 0.01, well within the ±0.05 macro budget). Algebraic identity of JacobiSVD vs eigh on 500 real KITTI patch covariances confirmed to FP precision for all derived scalars (linearity_, planarity_, ground_flatness, line_variable, normal_ up to sign). Patchwork classic is unchanged on the perf side (TBB-bound, SVD is sub-µs/patch after parallel_for); the win is concentrated in Patchwork++ where these allocations dominated the profile.

* chore(release): v1.4.1 Patch bump for the per-patch heap-traffic refactor (#100). pypatchworkpp.patchworkpp gains 14.8% Hz (97.5 -> 111.9 Hz on KITTI seq 00, i7-12700), driven by killing short-lived vector<PointXYZ> / Eigen::Matrix allocations in R-VPF + R-GPF. Closes part of #96. Numerical equivalence verified end-to-end on KITTI seq 00 (4541 frames): F1 delta 0.00 for patchwork classic (bit-identical), +0.01 for patchwork++ (within the ±0.05 macro budget). Bumps: - python/pyproject.toml 1.4.0 -> 1.4.1 - cpp/CMakeLists.txt 1.4.0 -> 1.4.1 CHANGELOG.md updated with the full v1.4.1 entry. See #100. * chore(release): mdformat CHANGELOG.md * chore(release): mdformat 0.7.9 escapes

LimHyungTae merged commit 1f623d5 into master May 23, 2026
18 checks passed

LimHyungTae deleted the perf/patchworkpp-alloc-and-eigh branch May 23, 2026 05:08

LimHyungTae mentioned this pull request May 23, 2026

chore(release): v1.4.1 #101

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(patchworkpp): kill per-patch heap traffic in R-VPF / R-GPF (+14.8% Hz)#100

perf(patchworkpp): kill per-patch heap traffic in R-VPF / R-GPF (+14.8% Hz)#100
LimHyungTae merged 1 commit into
masterfrom
perf/patchworkpp-alloc-and-eigh

LimHyungTae commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LimHyungTae commented May 23, 2026

Summary

What changed

Benchmarks

Numerical equivalence

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant