Skip to content

perf(patchworkpp): kill per-patch heap traffic in R-VPF / R-GPF (+14.8% Hz)#100

Merged
LimHyungTae merged 1 commit into
masterfrom
perf/patchworkpp-alloc-and-eigh
May 23, 2026
Merged

perf(patchworkpp): kill per-patch heap traffic in R-VPF / R-GPF (+14.8% Hz)#100
LimHyungTae merged 1 commit into
masterfrom
perf/patchworkpp-alloc-and-eigh

Conversation

@LimHyungTae
Copy link
Copy Markdown
Member

Summary

Cuts per-frame Patchwork++ time on KITTI seq 00 from 10.26 ms → 8.94 ms (97.5 Hz → 111.9 Hz, +14.8% Hz, median of 3 runs, i7-12700). Patchwork classic is unaffected (already TBB-bound, see #95).

The win comes from killing short-lived heap allocations in the per-patch hot loop, exactly the bottleneck identified in #96.

What changed

High-impact (this is where the 14.8% comes from):

  1. PatchWorkpp::estimate_plane: drop Eigen::MatrixX3f eigen_ground + centered + centered.adjoint() * centered. Replace with a single-pass scalar accumulation of mean and 9 cross-products, then build the 3×3 cov on the stack. No more per-call Eigen heap allocations.
  2. PatchWorkpp::extract_piecewiseground: promote src_wo_verticals and src_tmp to reused instance scratch members. vector::clear() keeps capacity, so per-patch malloc pressure on the glibc heap (which was serializing the loop per Why Patchwork++ has no TBB parallelisation mode #96) drops away after the first few patches.
  3. PatchWorkpp::estimateGround main loop: auto& zone instead of auto zone for ConcentricZoneModel_[zone_idx]. Avoids a deep-copy of the full 3-level nested vector per outer iteration. Safe: the in-place std::sort mutates patches that are read once and then flushed at the top of every estimateGround call.

Lower-impact, kept for cleanliness:

  1. JacobiSVD<Matrix3f>SelfAdjointEigenSolver::computeDirect for the 3×3 PSD covariance in both cpp/common/src/plane_fit.cpp and the in-place PatchWorkpp::estimate_plane. Closed-form, no Jacobi iterations. singular_values_ is repacked descending so every consumer (linearity/planarity in common, flatness_thr index (2) in patchwork classic, ground_flatness=minCoeff() and line_variable=sv(0)/sv(1) in patchwork++) keeps the same convention bit-for-bit.
  2. const& on addCloud's add parameter, RevertCandidate loop vars, and the temporal_ground_revert / calc_point_to_plane_d / calc_mean_stdev signatures.

Benchmarks

KITTI seq 00, 2900 timed frames, median of 3 runs:

Stage patchwork++ patchwork (classic)
baseline (JacobiSVD + per-call allocs) 10.26 ms / 97.5 Hz 4.26 ms / 234.7 Hz
eigh only 9.94 ms / 100.6 Hz (+3.2%) 4.30 ms / 232.8 Hz (noise)
eigh + alloc-free 8.94 ms / 111.9 Hz (+14.8%) 4.29 ms / 232.9 Hz (noise)

Patchwork classic is unchanged on the perf side: TBB parallel_for already amortizes allocations across 24 cores and SVD is sub-µs/patch.

Numerical equivalence

Full KITTI seq 00 (4541 frames):

method (protocol) before after Δ
patchwork (pw) P 92.34, R 94.64, F1 93.41 P 92.34, R 94.64, F1 93.41 0.00
patchwork++ (pp) P 94.88, R 98.47, F1 96.62 P 94.89, R 98.48, F1 96.63 +0.01 F1

Algebraic identity of JacobiSVD vs eigh verified on 500 real KITTI patch covariances: normal_ (up to sign), singular_values_, linearity_, planarity_, ground_flatness, and line_variable all match to FP precision. All deltas well within the ±0.05 macro / ±0.10 per-seq budget.

Refs #96.

Test plan

  • pip install -v ./python/ builds clean on Linux (gcc 13)
  • evaluate_semantickitti.py --method patchworkpp --eval_protocol patchworkpp --seqs 00 matches baseline within ±0.05 F1
  • evaluate_semantickitti.py --method patchwork --eval_protocol patchwork --seqs 00 bit-identical to baseline
  • bench_hz.py --method patchworkpp --seq 00: +14.8% Hz, stable across 3 runs
  • bench_hz.py --method patchwork --seq 00: within run-to-run noise of baseline
  • CI matrix (Linux/macOS/Windows wheels + ros2_node + cpp_api) green

…8% Hz)

Three changes inside PatchWorkpp::extract_piecewiseground and
PatchWorkpp::estimate_plane that together take KITTI seq 00 from
97.5 Hz to 111.9 Hz (median, 24-core i7-12700, 2900 timed frames):

* estimate_plane: replace MatrixX3f eigen_ground / centered /
  centered.adjoint() * centered with a single-pass scalar accumulation
  of mean and 9 cross-products, then build the 3x3 cov on the stack.
  No more per-call Eigen heap allocations.
* extract_piecewiseground: promote src_wo_verticals and src_tmp to
  reused instance scratch members. Per-patch malloc pressure on the
  glibc heap (which was serialising the loop, see issue #96) goes away
  after the first few patches because vector::clear() retains capacity.
* estimateGround main loop: `auto& zone` instead of `auto zone` for the
  ConcentricZoneModel_[zone_idx] read. Avoids a deep-copy of the full
  3-level nested vector per outer iteration; the in-place std::sort is
  safe because each (zone, ring, sector) patch is read once and the
  CZM is flushed at the top of every estimateGround call.

Also (smaller wins, kept for cleanliness):
* JacobiSVD<Matrix3f> -> SelfAdjointEigenSolver::computeDirect on the
  3x3 PSD covariance in both cpp/common/src/plane_fit.cpp and the
  patchworkpp in-place estimate_plane. Closed-form, no Jacobi iterations.
  singular_values_ is repacked descending so every consumer (linearity_
  / planarity_ in common, flatness_thr index (2) in patchwork classic,
  ground_flatness=minCoeff() and line_variable=sv(0)/sv(1) in
  patchwork++) keeps the same convention bit-for-bit.
* const& on addCloud's 'add' parameter, RevertCandidate loop variables,
  and the temporal_ground_revert / calc_point_to_plane_d /
  calc_mean_stdev signatures.

Numerical equivalence verified end-to-end on KITTI seq 00:
* patchwork (pw protocol):   P/R/F1 unchanged to 0.01 (bit-identical).
* patchwork++ (pp protocol): F1 96.62 -> 96.63 (delta 0.01, well within
  the ±0.05 macro budget). Algebraic identity of JacobiSVD vs eigh on
  500 real KITTI patch covariances confirmed to FP precision for all
  derived scalars (linearity_, planarity_, ground_flatness,
  line_variable, normal_ up to sign).

Patchwork classic is unchanged on the perf side (TBB-bound, SVD is
sub-µs/patch after parallel_for); the win is concentrated in
Patchwork++ where these allocations dominated the profile.
@LimHyungTae LimHyungTae merged commit 1f623d5 into master May 23, 2026
18 checks passed
@LimHyungTae LimHyungTae deleted the perf/patchworkpp-alloc-and-eigh branch May 23, 2026 05:08
@LimHyungTae LimHyungTae mentioned this pull request May 23, 2026
4 tasks
LimHyungTae added a commit that referenced this pull request May 23, 2026
* chore(release): v1.4.1

Patch bump for the per-patch heap-traffic refactor (#100).
pypatchworkpp.patchworkpp gains 14.8% Hz (97.5 -> 111.9 Hz on KITTI
seq 00, i7-12700), driven by killing short-lived vector<PointXYZ> /
Eigen::Matrix allocations in R-VPF + R-GPF. Closes part of #96.

Numerical equivalence verified end-to-end on KITTI seq 00 (4541
frames): F1 delta 0.00 for patchwork classic (bit-identical), +0.01
for patchwork++ (within the ±0.05 macro budget).

Bumps:
- python/pyproject.toml  1.4.0 -> 1.4.1
- cpp/CMakeLists.txt     1.4.0 -> 1.4.1

CHANGELOG.md updated with the full v1.4.1 entry.

See #100.

* chore(release): mdformat CHANGELOG.md

* chore(release): mdformat 0.7.9 escapes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant