Queue Architecture

Design & Inspiration

This project is inspired by the DPDK rte_ring architecture and focuses on building a high-performance lock-free ring buffer for low-latency systems workloads.

The primary goal is understanding real inter-core synchronization behavior under contention rather than only maximizing synthetic throughput numbers.

Current design goals:

Fixed-size FIFO ring buffer.
Lock-free synchronization for low-latency communication.
Specialized SPSC and MPMC implementations.
Compile-time configurable batching.
Bulk enqueue/dequeue APIs.
Cache-aware synchronization variable caching.
Explicit cacheline alignment and false-sharing reduction.
Publication-frequency tuning.
SIMD-assisted payload movement for larger payloads.

Status: SPSC paths are heavily optimized and benchmarked. MPMC paths exist but still require deeper scalability and stress validation.

Queue Architecture

The implementation was later refactored into four specialized queue classes:

Class	Purpose
`SPSCQueue<T>`	Low-latency SPSC queue
`SPSCQueueBatch<T>`	Throughput-oriented batched SPSC queue
`MPMCQueue<T>`	Multi-producer multi-consumer queue
`MPMCQueueBatch<T>`	Batched MPMC queue

This separation avoids:

leaking batching logic into latency paths
leaking MPMC synchronization into SPSC hot paths
unnecessary branching and template bloat
instruction-cache pollution

At high optimization levels, simplifying generated machine code became more important than keeping a unified abstraction.

Key Optimizations

Cacheline Separation

Producer and consumer synchronization variables are isolated:

alignas(64) std::atomic<uint32_t> prod_tail_;
alignas(64) std::atomic<uint32_t> cons_tail_;

This reduces:

MESI invalidation traffic
cacheline bouncing
false sharing

Power-of-Two Masking

Index wrapping uses:

index & (capacity - 1)

instead of modulo operations to avoid division instructions in hot paths.

Cached Remote Indices

Producer and consumer cache remote synchronization variables locally.

This reduces:

acquire-load frequency
coherence traffic
shared cacheline reloads

This optimization became increasingly important on physically distant cores.

SIMD Payload Copy Paths

Specialized AVX2 paths exist for 32-byte and 64-byte payloads:

if constexpr (sizeof(T) == 32)
{
    _mm256_storeu_si256(...);
}

This reduces:

scalar copy instructions
dependency chains
payload movement overhead

Performance impact depends heavily on:

compiler backend quality
payload alignment
microarchitecture behavior

Benchmark Results

Throughput & Latency Benchmarks

The following results represent averages across 10 independent benchmark runs.

Threads were pinned to physically separate cores (2 -> 4) to expose realistic inter-core synchronization costs.

Pointer Benchmark

Implementation	Throughput (M ops/sec)
LFQueue	121.0
Drogalis	106.2
Rigtorp	69.5
BoostSPSC	61.6

Payload2 (2 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
LFQueue	80.6	502	528	556
Rigtorp	56.6	493	508	542
Drogalis	60.7	497	512	548
BoostSPSC	52.0	490	526	569

Payload4 (4 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
LFQueue	115.9	488	520	539
Rigtorp	68.4	493	504	537
Drogalis	92.9	495	501	538
BoostSPSC	73.8	489	517	569

Payload8 (8 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
LFQueue	110.1	499	527	549
Rigtorp	76.3	493	506	537
Drogalis	106.9	495	503	539
BoostSPSC	58.6	489	526	567

Payload16 (16 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
Drogalis	81.0	499	518	546
LFQueue	70.3	487	525	546
Rigtorp	65.9	495	516	541
BoostSPSC	50.3	492	521	580

Payload32 (32 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
LFQueue	72.6	380	368	446
Rigtorp	50.8	375	367	428
Drogalis	50.5	380	368	431
BoostSPSC	32.7	352	357	409

Payload64 (64 bytes)

Implementation	Throughput (M ops/sec)	Avg Latency	P50	P99
LFQueue	45.3	404	390	492
Drogalis	36.0	397	388	474
Rigtorp	35.6	396	389	475
BoostSPSC	16.1	381	388	502

Batch Throughput & Latency

Benchmark	Result
LFQueueBatch Throughput	412.6 M ops/sec
Avg Batch Latency	661 cycles
P50 Batch Latency	640 cycles
P99 Batch Latency	822 cycles

Burst Sweep Results

Burst Size	Throughput (M ops/sec)
1	25.8
2	52.2
4	109.4
8	165.4
16	324.0
32	442.2
64	582.3

Observation

Burst throughput scales aggressively as synchronization overhead becomes amortized across larger batches.

This demonstrates that:

atomic publication overhead dominates small bursts
cacheline reuse improves significantly at larger bursts
synchronization amortization becomes critical at high throughput levels

Occupancy Sweep Results

Queue Occupancy	Throughput (M ops/sec)
25%	143.0
50%	119.0
75%	131.0
95%	127.5

Observation

The queue maintains relatively stable throughput even under high occupancy levels, indicating good behavior near saturation and reduced degradation under backpressure.

Publication Sweep Results

Publish Interval	Throughput (M ops/sec)
1	561.7
2	555.8
4	551.9
8	554.0
16	558.8
32	559.6
64	681.2

Observation

Less frequent publication significantly reduces synchronization pressure and cache coherency traffic.

Publishing every 64 items produced the highest throughput by amortizing:

acquire/release synchronization
atomic store overhead
cacheline ownership transfers

Benchmark Methodology

The benchmark suite measures:

hot-path throughput
synchronization overhead
cache coherency pressure
inter-core communication behavior
batching efficiency
publication-frequency tradeoffs
latency stability

The harness includes:

warmup phases
thread pinning
cyclic payload pools
repeated mean/median analysis
burst sweeps
occupancy sweeps
publication-frequency sweeps

Threads are pinned to physically separate cores to expose realistic synchronization costs.

Benchmark Configuration

Setting	Value
Hardware	HP Laptop 15s
CPU	AMD Ryzen 5
Memory	16GB RAM
Operations	50,000,000
Queue Capacity	65,536
Build Flags	`-O3 -march=native -mavx2 -ftree-vectorize -fno-tree-slp-vectorize -flto`
Benchmark Runs	10 averaged runs
CPU Governor	`performance`
Thread Pinning	Enabled
Warmup Phase	Enabled

CPU Isolation & Benchmark Environment

The following configuration was used during benchmarking:

sudo cpupower frequency-set -g performance

Edit GRUB:

sudo nano /etc/default/grub

Replace:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

with:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,4 nohz_full=2,4 rcu_nocbs=2,4"

Apply:

sudo update-grub
sudo reboot

Disable boost:

echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

Disable SMT:

cat /sys/devices/system/cpu/smt/control
echo off | sudo tee /sys/devices/system/cpu/smt/control

This reduces:

scheduler noise
SMT sibling interference
frequency oscillation
kernel tick interruptions

What The Benchmarks Actually Measure

Throughput Benchmarks

Measures:

enqueue/dequeue hot-path cost
synchronization overhead
cache coherency traffic
payload movement efficiency

This is primarily:

a synchronization benchmark
a cache coherency benchmark
a compiler code-generation benchmark

not a pure memory-bandwidth benchmark.

Latency Benchmarks

Measures:

inter-core roundtrip latency
publication visibility delay
tail latency stability

Metrics:

average latency
p50 latency
p99 latency

Latency is measured using rdtscp.

Burst Sweeps

Burst sizes:

1
2
4
8
16
32
64

Measures:

synchronization amortization
batching scalability
cacheline reuse efficiency

Small bursts:

lower latency
higher synchronization frequency

Large bursts:

higher throughput
reduced atomic overhead

Publication Sweeps

Varies producer publication frequency.

Measures:

acquire/release overhead
cache coherency traffic
synchronization frequency cost

Publishing less frequently:

improves throughput
reduces coherence traffic

but increases visibility delay.

Occupancy Sweeps

Queue is prefilled to:

25%
50%
75%
95%

Measures:

backpressure behavior
near-full contention
saturation degradation

Many queues benchmark well when empty but degrade near capacity.

Important Observations

SMT Sibling Measurements Were Misleading

Earlier benchmarks used SMT siblings:

0 -> 1

which shared:

L1 cache
store buffers
execution resources

This produced artificially optimistic results.

Current benchmarks use physically separate cores:

0 -> 2

to expose real inter-core synchronization costs.

Benchmark Harnesses Can Become Bottlenecks

Earlier benchmark versions accidentally measured:

payload construction
allocation overhead
memory initialization

instead of queue performance itself.

This was fixed using:

cyclic payload pools
warmup phases
pre-generated payload buffers

CPU Isolation Misconfiguration Can Destroy Performance

Incorrect isolcpus / nohz_full usage caused isolated CPUs to enter deep idle states.

This collapsed throughput from:

~300M ops/sec to:
below 100M ops/sec

despite relatively low instruction counts.

This reinforced a major lesson:

Benchmark environment quality matters as much as algorithm quality.

Current Optimizations

Implemented optimizations include:

Cached synchronization variables
Reduced synchronization frequency
Compile-time batching
Bulk enqueue/dequeue APIs
Cacheline-separated synchronization state
Branch prediction hints
always_inline hot paths
Power-of-two masking
SIMD-assisted payload copies
Publication amortization
Payload pre-generation pools
Warmup-aware benchmarking

Future Work

Planned improvements:

NUMA-aware allocation
Hugepage-backed rings
ARM NEON optimizations
Adaptive batching
Non-temporal stores
Prefetch tuning
perf c2c analysis
Store-buffer pressure analysis
Hybrid spin/backoff strategies
Better MPMC scalability tuning

Running Benchmarks

Build and run the GCC benchmark suite:

make run

Build and run the Clang benchmark suite:

make run-clang

Override CPU placement:

make run PRODUCER_CORE=2 CONSUMER_CORE=4

make run-clang PRODUCER_CORE=2 CONSUMER_CORE=4

Using physically separate cores is strongly recommended.

Example:

2 -> 4

Avoid SMT sibling pairs such as:

0 -> 1

because they share:

L1 cache
execution resources
store buffers

which produces artificially optimistic synchronization results.

ThreadSanitizer Validation

Build and run ThreadSanitizer validation:

make check-sanity

ThreadSanitizer is used for:

synchronization validation
memory ordering verification
race detection
lock-free correctness checking

ASLR is disabled during sanitizer execution for compatibility with modern Linux kernels.

Assembly & perf Analysis

The project includes dedicated assembly inspection and perf analysis targets for:

LFQueue
Drogalis
Rigtorp

LFQueue

Generate GCC assembly:

make lf-asm

Generate Clang assembly:

make lf-asm-clang

Run benchmark:

make lf-run

Run perf counters:

make perf-lf

Run perf counters using Clang build:

make perf-lf-clang

Drogalis

Generate GCC assembly:

make dro-asm

Generate Clang assembly:

make dro-asm-clang

Run benchmark:

make dro-run

Run perf counters:

make perf-dro

Run perf counters using Clang build:

make perf-dro-clang

Rigtorp

Generate GCC assembly:

make rigtorp-asm

Generate Clang assembly:

make rigtorp-asm-clang

Run benchmark:

make rigtorp-run

Run perf counters:

make perf-rigtorp

Run perf counters using Clang build:

make perf-rigtorp-clang

What perf Measures

The perf targets collect:

Counter	Description
`cycles`	Total CPU cycles consumed
`instructions`	Total retired instructions
`branches`	Number of branch instructions
`branch-misses`	Branch prediction failures
`cache-misses`	Last-level cache miss events

Conclusion

This project evolved from a lock-free queue implementation into a broader investigation of:

low-latency synchronization
cache coherency behavior
compiler backend effects
benchmarking methodology
inter-core communication

At sufficiently optimized levels, performance became dominated by:

cache topology
coherence traffic
synchronization frequency
compiler code generation
benchmark quality

rather than only queue algorithm design.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
adapters		adapters
benchmark		benchmark
include		include
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Design & Inspiration

Queue Architecture

Key Optimizations

Cacheline Separation

Power-of-Two Masking

Cached Remote Indices

SIMD Payload Copy Paths

Benchmark Results

Throughput & Latency Benchmarks

Pointer Benchmark

Payload2 (2 bytes)

Payload4 (4 bytes)

Payload8 (8 bytes)

Payload16 (16 bytes)

Payload32 (32 bytes)

Payload64 (64 bytes)

Batch Throughput & Latency

Burst Sweep Results

Observation

Occupancy Sweep Results

Observation

Publication Sweep Results

Observation

Benchmark Methodology

Benchmark Configuration

CPU Isolation & Benchmark Environment

What The Benchmarks Actually Measure

Throughput Benchmarks

Latency Benchmarks

Burst Sweeps

Publication Sweeps

Occupancy Sweeps

Important Observations

SMT Sibling Measurements Were Misleading

Benchmark Harnesses Can Become Bottlenecks

CPU Isolation Misconfiguration Can Destroy Performance

Current Optimizations

Future Work

Running Benchmarks

ThreadSanitizer Validation

Assembly & perf Analysis

LFQueue

Drogalis

Rigtorp

What perf Measures

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages