High-performance C++23 primitives for low-latency and real‑time systems.
This repository is structured as a small, focused "standard library replacement" for HPC-style workloads: single/multi-core pipelines, fixed-capacity queues, custom allocators, and shared‑memory IPC, all built with explicit attention to cache behavior, memory ordering, and observability.
Core principles:
- Latency first: Prefer bounded queues, pre-allocated memory, and zero‑copy
data paths over generality. All hot paths are
noexceptand branch‑light. - Predictable memory: Arena and pool allocators avoid
malloc/freevariability and keep data contiguous in L1. - Cache coherency aware: Cache line padding, power-of-two ring sizes, and TTAS spinlocks minimize false sharing and unnecessary cache line invalidations.
- Explicit memory ordering: Atomics use the weakest ordering that is still
correct (
memory_order_relaxedvsacquire/release), with comments explaining the happens‑before relationships. - Reproducible performance: Benchmarks pin threads to cores, warm up before measurement, and report comparable numbers vs. standard library baselines.
- Production‑oriented layout: Modern CMake, clear include hierarchy, and
separable
tests/andbenchmarks/mirror a real library used across many services.
Repository skeleton:
cpp-hpc-primitives/
├── benchmarks/ # Google Benchmark microbenchmarks
├── cmake/ # Toolchain & dependency modules
├── include/
│ └── hpc/
│ ├── core/ # Concurrency and allocation primitives
│ ├── ipc/ # Shared-memory IPC structures
│ └── support/ # Timing, CPU topology, cache-line helpers
├── src/ # Non-header-only implementations
├── tests/ # GoogleTest unit and stress tests
├── CMakeLists.txt # Modern CMake + FetchContent
└── README.md # This document
Type: hpc::core::spsc_ring_buffer<T>
A single‑producer, single‑consumer ring buffer that matches the needs of a feed‑handler → strategy or strategy → gateway pipeline.
Key design points:
- False sharing prevention: Producer and consumer indices are separated by
cache‑line padding (
alignas(64)), so updating the head does not invalidate the cache line containing the tail (and vice versa). - Power‑of‑two capacity: Capacity is forced to a power of two. Index wrap‑around
uses a bit mask (
index & (capacity - 1)) instead of modulo division. - Batch API:
push_batch(T* items, size_t count)(and corresponding batch pop) amortize the cost of the store‑side barrier and bounds checks across many elements. - Zero‑copy slots:
T* alloc_slot()andcommit_slot()allow the producer to construct messages in place inside the ring, avoiding an extra copy and improving cache locality for complex message types. - Memory ordering: Atomics use
memory_order_acquire/memory_order_releaseacross producer/consumer boundaries withmemory_order_relaxedfor local progress where safe. Comments in the implementation document the required happens‑before relationships per operation.
Types: hpc::core::arena, hpc::core::arena_allocator<T>
A pointer‑bump allocator used for short‑lived, bulk‑freed objects (e.g., per‑snapshot order books or per‑batch message decoding).
Characteristics:
- O(1) allocation: Linear bump of a single pointer; no locks, no free list.
- Noexcept: Allocation does not throw; failure is returned explicitly and can be handled on a slow path.
- Bulk reset: The arena can be reset in O(1), reclaiming the entire region at once when a batch or snapshot is complete.
- Cache‑friendly: Objects allocated together are contiguous in memory, improving hardware prefetch efficiency and TLB locality.
Types: hpc::core::fixed_pool, hpc::core::pool_allocator<T>
A free‑list based allocator for fixed‑size objects such as Order, Quote,
OrderBookNode, etc.
Characteristics:
- Deterministic allocation: Fixed block size, constant‑time allocate/free.
- Locality: The pool is backed by contiguous slabs, so active objects tend to live in a small number of cache lines.
- Low fragmentation: No general heap metadata or per‑allocation headers; the free list is embedded into freed blocks.
Type: hpc::core::ttas_spinlock
A test‑test‑and‑set spinlock with exponential backoff.
Why TTAS:
- Reduced cache line invalidation: Threads first perform a plain read of the lock variable. Only when it looks free do they attempt an atomic exchange. This avoids thrashing the cache line in the common contended case.
- Backoff strategy: On failure, the lock performs a short
_mm_pause()loop (hyper‑threading friendly) before optionally yielding to the OS.
Appropriate uses:
- Very short critical sections on hot paths where
std::mutexwould introduce unacceptable scheduling latency. - Coarse‑grained configuration or statistics updates where you want to avoid the complexity of full lock‑free structures.
Type: hpc::ipc::shm_spsc_ring_buffer<T>
An SPSC ring buffer hosted in a POSIX shared memory segment (shm_open/mmap),
allowing two processes to communicate with
- Zero per‑message dynamic allocation,
- A fixed, bounded buffer,
- And no serialization overhead beyond the in‑memory representation of
T.
Typical uses:
- C++ feed handler → Python/PyTorch process for model inference.
- C++ matching engine → monitoring/analytics process.
The layout is intentionally simple and documented so that non‑C++ consumers can attach to the shared memory region and parse messages using only a struct definition.
Microbenchmarks are implemented with Google Benchmark in benchmarks/. Benchmarks
pin threads to cores and run in steady‑state to reduce OS jitter and warm‑up
artifacts.
All numbers below are from a local development machine; they are intended to show relative speedups rather than absolute limits. Exact hardware and compiler flags are recorded in the benchmark output.
| Component | Operations/sec | Wall time (ns) | CPU time (ns) | Baseline |
|---|---|---|---|---|
std::queue (SPSC baseline) |
722.06 M/s | 1419 ns | 1418 ns | – |
hpc::SPSCQueue |
1.05453 G/s | 971 ns | 971 ns | 1.46x vs std::queue |
malloc |
96.24 M/s | 10641 ns | 10640 ns | – |
hpc::arena_allocator |
441.39 M/s | 2320 ns | 2320 ns | 4.59x vs malloc |
hpc::pool_allocator |
671.70 M/s | 1525 ns | 1524 ns | 6.98x vs malloc |
std::mutex (contended) |
1.63 G/s | 1456410 ns* | 40318 ns* | – |
hpc::TTASSpinLock (contended) |
2.58 G/s | 250749 ns* | 25410 ns* | 1.59x vs std::mutex |
std::queue + std::mutex (MPMC) |
21.56 G/s | 59646021 ns** | 48640 ns** | – |
hpc::MPMCQueue |
5.35 G/s | 90331722 ns** | 195950 ns** | 0.25x vs std::queue+std::mutex |
malloc (1 MiB) |
– | 13.7 ns*** | 13.7 ns*** | – |
new[] (1 MiB) |
– | 15.5 ns*** | 15.5 ns*** | – |
huge_page_alloc (1 MiB) |
– | 1.12 ns*** | 1.12 ns*** | ~12x faster vs malloc/new[] |
* Spinlock vs std::mutex numbers are from a contended critical-section benchmark
(benchmarks/bench_spinlock.cpp) with multiple threads incrementing a shared
counter; the absolute nanoseconds include benchmark harness overhead and should
be interpreted relatively. Wall time reflects end-to-end duration including
blocking/scheduling; CPU time reflects active CPU work per iteration.
** MPMC throughput numbers are from a 2‑producer / 2‑consumer benchmark pushing
1,048,576 items per producer (benchmarks/bench_mpmc_ring_buffer.cpp). In this
particular microbenchmark, std::queue protected by a single std::mutex is
faster than the lock‑free hpc::core::mpmc_ring_buffer. This is expected:
- The critical section around
std::queueis extremely small (push/pop on a pre-allocated container with no dynamic allocation in the hot path), sostd::mutexcan remain very efficient. - The lock‑free MPMC queue pays additional per-slot overhead (sequence counters, extra atomics, CAS loops) to provide multi‑producer/multi‑consumer semantics without a central lock and to avoid classic ABA issues.
- Under light contention with few threads and a tiny critical section, the cost
of those extra atomics can outweigh the cost of taking a uncontended or
mildly contended
std::mutex.
*** BM_Malloc_Free, BM_New_Delete, and BM_HugePageAlloc_Free results are
from benchmarks/bench_numa_hugepages.cpp with 1 MiB and 16 MiB allocation
sizes. The table reports representative 1 MiB numbers; huge page allocation is
shown here as a best-case microbenchmark and actual benefit depends on OS
configuration (Linux MAP_HUGETLB, Windows MEM_LARGE_PAGES) and privileges.
The MPMC implementation in this repository is tuned for scalability and
predictable behavior under higher contention and more complex pipelines, not
for winning every synthetic microbenchmark against a single std::mutex.
These numbers are not synthetic micro‑optimizations; they map directly to real workloads:
- Higher queue throughput at lower per‑operation latency → more headroom for decode, normalization, and strategy logic per tick.
- Faster, more predictable allocators → lower tail latency when creating or recycling orders, quotes, and internal book structures.
- Cheaper spinlock acquisitions under contention → less time stuck in critical
sections compared to
std::mutex, especially for short, hot critical paths.
From the project root:
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DHPC_ENABLE_TESTS=ON \
-DHPC_ENABLE_BENCHMARKS=ON ..
cmake --build . -jRun unit tests:
ctest --output-on-failureRun benchmarks:
./benchmarks/hpc_benchmarksOn Linux, you can additionally pin benchmark processes to specific cores (via
taskset) or use the built‑in CPU affinity utilities in this repo to reduce
scheduler noise.
- Language standard: C++23.
- Build system: Modern CMake with
FetchContentfor GoogleTest and Google Benchmark. - Tests: Unit tests under
tests/validate correctness of the ring buffer, allocators, and spinlock. - Benchmarks: Microbenchmarks under
benchmarks/compare these primitives against standard library equivalents (std::queue,malloc/new, etc.).
From a clean checkout:
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DHPC_ENABLE_TESTS=ON \
-DHPC_ENABLE_BENCHMARKS=ON ..
cmake --build . -j
ctest --output-on-failure
./benchmarks/hpc_benchmarksTo illustrate the "C++ → Python/GPU" story, this repo includes a concrete shared-memory example:
examples/shm_publisher.cpp: C++ process that creates ahpc::ipc::shm_spsc_ring_buffer<Message>in a POSIX shared memory region and publishes fixed-sizeMessagestructs.examples/shm_subscriber.py: Python process that attaches to the same POSIX shared memory object (viaposix_ipc.SharedMemory("/hpc_shm_spsc_ring")) and decodes messages withstruct.Struct("<QQ48s"). On Linux this object also appears under/dev/shm, but on macOS it is only visible via the POSIX shared-memory APIs.docs/shm_ipc_design.md: Design notes for multi-subscriber shared-memory rings, backpressure policies, and how to feed data into PyTorch models.
The Message layout is intentionally simple and stable so that non-C++
consumers (Python, CUDA kernels) can treat the shared memory region as a
read-only array of records without any C++ ABI dependencies.
To build the example publisher:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DHPC_ENABLE_TESTS=ON \
-DHPC_ENABLE_BENCHMARKS=ON \
-DHPC_ENABLE_EXAMPLES=ON
cmake --build build -jRun the publisher (C++):
./build/hpc_shm_publisherThen in a separate shell, run the Python subscriber:
python3 examples/shm_subscriber.py