From Naive Implementations to Ultra Low-Latency Systems
A comprehensive journey through the algorithms that power high-frequency trading systems, real-time dashboards, and mission-critical applications where every microsecond counts.
Have you ever watched a trading dashboard breatheβcandles sprouting in real time, volumes pulsing like a heartbeatβand wondered what sort of sorcery keeps it all flowing when millions of trades per minute hammer the backend?
This repository accompanies a technical article series that explores the real algorithms and data structures used in high-frequency trading (HFT) systems. We start with naive, heap-based implementations and progressively optimize them into production-grade, off-heap solutions that eliminate GC pauses and achieve sub-microsecond latency.
In high-frequency trading, a single innocent-looking trade tick spawns a horde of objects. The heap swells, the dreaded garbage collector swoops in, andβbangβyour latency budget is assassinated. In the sub-millisecond world, those stalls aren't mere nuisances. They are lost profits, evaporated opportunities, and the reason your CTO hasn't slept.
Stop throwing faster CPUs and bigger heaps at your performance problems. The real demons are:
- Allocation churn - Creating and destroying millions of short-lived objects
- Synchronization hotspots - Lock contention that serializes your parallel work
- Fighting the hardware - Cache misses, false sharing, and unpredictable memory access patterns
Off-heap data structures live in native memory that the JVM doesn't manageβso the GC has nothing to clean. This repository shows you how.
This series covers 11 battle-tested scenarios, each implementing a real-world problem with both naive (on-heap) and optimized (off-heap) approaches:
| # | Scenario | Problem Domain | Key Optimization | Speedup |
|---|---|---|---|---|
| 01 | Circular Ring Buffer | OHLCV Candlestick Aggregation | Off-heap memory, false sharing elimination | 2x + 62% less memory |
| 02 | SPSC Queue | Trade tick ingestion pipeline | Lock-free single-writer pattern | 3x throughput |
| 03 | MPSC Queue | Multi-venue order routing | CAS-based coordination | 4x under contention |
| 04 | Work Distribution | Task queue for order processing | Sharded queues per core | 5x, no false sharing |
| 05 | Event Pipeline | Multi-stage trade enrichment | Disruptor-style ring buffer | 8x, predictable latency |
| 06 | Telemetry Logging | High-frequency metrics capture | Lossy, wait-free logging | Negligible overhead |
| 07 | Sharded Processing | Partitioned order book | Per-core ownership | Linear scaling |
| 08 | Zero-Allocation Messaging | Message passing without GC | Flyweight object pooling | Zero GC pressure |
| 09 | K-FIFO Buffer | Relaxed-order event processing | Bounded reordering | 2x throughput |
| 10 | Batch Processing | Grouped trade settlement | Amortized synchronization | 10x fewer context switches |
| 11 | Trading Application | Real-time trading dashboard | Composing all patterns | Production-ready system |
Each scenario follows a consistent structure:
scenario-XX-name/
βββ approach-01-naive/ # On-heap, synchronized baseline
β βββ src/main/java/ # Implementation
β βββ src/test/java/ # Comprehensive tests
βββ approach-02-optimized/ # Off-heap, lock-free solution
β βββ src/main/java/ # Optimized implementation
β βββ src/test/java/ # Same tests + performance tests
βββ benchmarks/ # JMH microbenchmarks
β βββ src/jmh/java/ # Comparative benchmarks
βββ README.md # Scenario deep-dive
Every optimized implementation follows these principles:
- Off-Heap Memory Management - Using
sun.misc.Unsafeorjava.nio.ByteBufferto allocate outside the heap - Lock-Free Coordination - CAS operations and memory barriers instead of locks
- Cache-Friendly Design - Padding to eliminate false sharing, sequential access patterns
- Zero-Copy Operations - Direct memory access without intermediate buffers
- Mechanical Sympathy - Code that respects how modern CPUs actually work
- Java 21+ (uses modern JVM features)
- Gradle 8.5+ (build automation)
- Understanding of Java memory model (watch this)
# Build everything
./gradlew build
# Run all tests
./gradlew test
# Run specific scenario tests
./gradlew :scenario-01-circurlar-ring-buffer:approach-02-optimized:test
# Run benchmarks (requires JMH)
./gradlew :scenario-01-circurlar-ring-buffer:benchmarks:jmhAfter running tests, view HTML reports:
open scenario-01-circurlar-ring-buffer/approach-02-optimized/build/reports/tests/test/index.htmlIf you're new to off-heap programming, follow this order:
-
Scenario 01: Circular Ring Buffer - Foundation concepts
- Off-heap memory allocation
- Memory barriers and visibility
- False sharing and cache line padding
-
Scenario 02: SPSC Queue - Lock-free basics
- Single-writer principle
- Volatile semantics
- When lock-free actually matters
-
Scenario 03: MPSC Queue - Handling contention
- CAS operations
- ABA problem and solutions
- Backoff strategies
- Scenarios 04-07 - Advanced patterns for distributed workloads
- Scenarios 08-10 - Extreme optimization techniques
- Scenario 11 - Putting it all together in a real application
Every implementation includes:
- β Unit tests - Correctness verification
- β Concurrency tests - Thread-safety validation with high contention
- β Edge case tests - Boundary conditions, wraparounds, error cases
- β Stress tests - Extended runs to catch memory leaks and race conditions
- β JMH benchmarks - Objective performance measurements
Test coverage: 85%+ on all optimized implementations
All benchmarks run on: Apple M2 Pro, 10 cores, 32GB RAM, Java 21
Benchmark Mode Cnt Score Error Units
RingBufferBenchmark.naiveOffer avgt 25 245.3 Β± 12.1 ns/op
RingBufferBenchmark.optimizedOffer avgt 25 92.7 Β± 4.3 ns/op (2.6x faster)
Memory footprint: -62% (off-heap doesn't count against heap)
GC pause frequency: -100% (no heap allocations)
8 producers, 1 consumer, 1M messages:
- Naive (synchronized): ~180K msg/sec, ~250ms p99 latency
- Optimized (lock-free): ~720K msg/sec, ~45ms p99 latency
Under heavy contention: 4x throughput improvement
See individual scenario READMEs for detailed benchmarks.
The final scenario composes all patterns into a real-time trading application:
- Ingests trades from multiple venues (MPSC)
- Aggregates OHLCV candles (Ring Buffer)
- Processes orders through multi-stage pipeline (Event Pipeline)
- Logs telemetry without blocking (Lossy Logging)
- Handles 500K+ trades/second with p99 latency < 100ΞΌs
cd scenario-11-trading-application
../gradlew rundocs/
βββ INDEX.md # π Documentation hub (start here!)
βββ REPOSITORY-GUIDE.md # πΊοΈ Complete repository overview
βββ deep-dives/ # π― Scenario deep-dives (11 scenarios)
β βββ SCENARIO-01-APPROACH-01-NAIVE.md
β βββ SCENARIO-01-APPROACH-02-OFFHEAP.md
β βββ SCENARIO-02-SPSC-QUEUE.md
β βββ ... # (8 more scenarios)
βββ guides/ # π Practical guides
β βββ GETTING-STARTED.md # Setup and quick start
β βββ MEMORY-MANAGEMENT.md # Off-heap memory techniques
β βββ PROFILING-GUIDE.md # JFR, GC logs, analysis
β βββ JMH-VS-REGULAR-JAVA-RESEARCH.md # Profiling approach
β βββ PERFORMANCE-TUNING.md # JVM optimization
βββ assets/graphs/ # π Performance visualizations
βββ scenario-01/approach-comparison.png
βββ scenario-02/approach-comparison.png
βββ scenario-03/approach-comparison.png
Key Entry Points:
- π New to off-heap? β docs/INDEX.md
- ποΈ Want complete overview? β docs/REPOSITORY-GUIDE.md
- π Ready to profile? β docs/guides/PROFILING-GUIDE.md
- Java 21 - Modern language features, virtual threads
- sun.misc.Unsafe - Direct memory access (approach-02 implementations)
- JMH - Microbenchmarking framework
- JUnit 5 - Testing framework
- AssertJ - Fluent assertions
- Jacoco - Code coverage
- Gradle - Build automation
This repository demonstrates educational implementations using sun.misc.Unsafe. For production systems, consider these battle-tested libraries:
- OpenHFT Chronicle Libraries β - Production-hardened off-heap tools
- Chronicle Queue - Persistent messaging (0.78Β΅s p99)
- Chronicle Map - Off-heap key-value store
- Chronicle Wire - Zero-GC serialization
π Read our comprehensive OpenHFT guide for detailed comparisons and integration examples
Scenarios use real market data from World Stock Prices Daily Updating (Kaggle).
Place the CSV file in the project root or configure the path in scenario READMEs.
These implementations are educational and demonstrate core concepts. For production use:
- β Add comprehensive error handling and recovery
- β
Implement proper resource cleanup (
try-with-resources) - β Add monitoring and observability hooks
- β Validate on your target hardware and JVM version
- β
Use proven libraries instead of rolling your own:
- OpenHFT Chronicle Libraries - Battle-tested in banks and hedge funds (Our Guide)
- Agrona - Off-heap utilities by Real Logic
- JCTools - High-performance concurrent queues
- LMAX Disruptor - Ring buffer pattern
- β Test under real workload patterns (not just microbenchmarks)
sun.misc.Unsafe Warning: This API is internal and subject to change. Java 22+ offers Foreign Function & Memory API as a replacement.
| Aspect | Educational (This Repo) | Production (OpenHFT/Agrona) |
|---|---|---|
| Purpose | Learn concepts | Ship to production |
| Testing | Comprehensive | Battle-tested in HFT |
| Edge Cases | Educational scope | Years of bug fixes |
| Performance | Optimized | Microsecond-level tuning |
| Support | Community | Commercial options |
| Monitoring | Basic | Production-grade |
π Start by understanding our implementations, then graduate to production libraries for real systems.
This is an educational repository. If you find issues or have improvements:
- Ensure tests pass:
./gradlew test - Add tests for new features
- Follow the existing code style
- Update relevant documentation
- Actor Model Series - Understanding concurrency patterns
- Trash Talk - JVM Memory Management - How the JVM really works
- OpenHFT Chronicle Libraries Guide β - Our comprehensive guide to production-grade off-heap tools
- Chronicle Queue - Persistent messaging with 0.78Β΅s p99 latency
- Chronicle Map - Off-heap key-value store with replication
- Chronicle Wire - Zero-GC serialization framework
- Java Thread Affinity - CPU pinning for predictable performance
- Complete integration examples and performance comparisons
- Mechanical Sympathy Blog - Martin Thompson's insights
- LMAX Disruptor - The pattern that started it all
- Java Concurrency in Practice - The definitive reference
- OpenHFT GitHub Organization - Production-grade off-heap libraries
MIT License - See LICENSE file for details
This project accompanies a technical article series exploring high-performance Java systems. Each scenario represents a real problem from high-frequency trading, with detailed explanations of:
- Why the naive approach fails
- What hardware-level bottlenecks exist
- How the optimized solution addresses them
- Performance characteristics and trade-offs
Goal: Bridge the gap between academic lock-free algorithms and real-world production systems.
| When You Need... | Use Scenario... | Key Pattern |
|---|---|---|
| Fixed-size cyclic buffer | 01 - Ring Buffer | Off-heap, sequential access |
| Single writer, single reader | 02 - SPSC | Lock-free, happens-before |
| Multiple writers, single reader | 03 - MPSC | CAS coordination |
| Multiple writers & readers | 04 - MPMC | Advanced CAS, backoff |
| Multi-stage processing | 05 - Event Pipeline | Disruptor pattern |
| Non-blocking logging | 06 - Telemetry | Lossy, wait-free |
| Per-core data | 07 - Sharding | Cache line alignment |
| Zero GC overhead | 08 - Zero-Alloc | Object pooling |
| Relaxed ordering OK | 09 - K-FIFO | Bounded reordering |
| Bulk operations | 10 - Batching | Amortization |
Start with Scenario 01: Circular Ring Buffer β
Build the foundation, then progress through increasingly complex scenarios. Each one builds on previous concepts while introducing new optimization techniques.
Good luck, and may your latencies be ever in your favor! π