A price-time-priority limit order book / matching engine in C++20 — header-only, allocation-free on the hot path, and checked against a naive reference by differential fuzzing.
▶ Live demo: https://orderbook.pages.dev — the same engine compiled to WebAssembly,
matching a simulated market in your browser (hover the depth, click a level to trade into
it). Source in web/.
Single-threaded, one core, warm cache (Apple M-series, -O3 -march=native, 1M orders
over a 65,536-tick book). make bench:
| op | latency | throughput |
|---|---|---|
rest |
47 ns | 21 M ops/s |
cancel |
75 ns | 13 M ops/s |
match |
55 ns/fill | 18 M fills/s |
mixed |
76 ns/op | 13 M ops/s |
mixed is a realistic interleave of adds, cancels, and marketable orders — the number
that reflects normal use.
Tail latency matters more than the mean in trading. Per-op cancel, netting out the
clock call (steady_clock, unpinned thread): p50 ~85 ns, p99 ~330 ns. The p50→p99 spread
is cache-miss variance on random-order cancellation; the p99.9+ tail (µs) is OS scheduling,
not the algorithm — the hot path allocates nothing, so there are no malloc/rehash spikes.
Characterising the true far tail needs core isolation + a cycle counter; the benchmark is
honest that it doesn't do that.
Limit, market, IOC (immediate-or-cancel), FOK (fill-or-kill), modify
(cancel/replace), and cancel — all price-time priority. modify keeps time priority on a
same-price size decrease and re-queues (loses priority, may cross) on any price change or
size increase, the standard exchange rule. Every one is checked against the reference in the
differential fuzz.
After a sized construction, add_limit / cancel / add_market perform zero heap
allocations. Order nodes come from a reused free list; the id→slot index is a custom
open-addressing flat map (flat_id_map.hpp) with backward-shift deletion — not a
node-based std::unordered_map that would malloc on every insert and free on every
erase. A test instruments global new/delete and asserts the counts are exactly zero
across 100k rests + 100k matches + 100k cancels (tests/test_alloc_audit.cpp).
A matching engine is only as good as its correctness proof. The core is validated by a
differential fuzz test: the same stream of random operations is applied to the fast
engine and to a deliberately naive ReferenceBook (std::map + std::deque), and after
every operation their fills and full book state are asserted identical — across limit,
market, IOC, FOK, modify, and cancel. 180,000 ops across three seeds, plus targeted unit
scenarios per order type. The two supporting data structures are fuzzed the same way against
obvious references:
test_orderbook— engine vs naive reference book (differential fuzz + units)test_occupancy— the best-pointer bitmap vs a brute-forcestd::settest_flat_id_map— the flat hash map vsstd::unordered_map+ the allocation audittest_alloc_audit— the whole engine's hot path is malloc-free
make test # all four suites, incl. the differential fuzz -> SUCCESSCI runs every suite under ASan + UBSan and the benchmark on each push. The fuzz passes clean under UBSan over the full 180k-op sweep.
An array-indexed price ladder. Prices are integer ticks in [0, num_ticks). Each
side holds a vector<Level> indexed by tick; each Level is an intrusive FIFO of
order nodes drawn from a reused pool. Two structures keep every operation fast:
- A two-level occupancy bitmap per side (
occupancy.hpp) marks which levels are non-empty, so moving the cached best-bid/ask to the next populated level is a bit-scan — no linear tick walk when a far level empties. - A flat open-addressing id map (
flat_id_map.hpp) gives O(1), allocation-free id→slot lookup for cancels and maker removal.
| operation | cost |
|---|---|
add_limit |
O(1) to rest; O(fills) when it crosses |
cancel |
O(1) unlink + O(1) id-map erase + a bit-scan best refresh |
match |
O(1) per fill + a bit-scan best refresh per emptied level |
The bit-scan touches at most ~ticks/4096 summary words — effectively constant for any
realistic tick count, and with no linear-scan worst case (the flaw a plain ladder has
when the best level empties far from the next one).
#include "orderbook/order_book.hpp"
using namespace ob;
OrderBook book(/*num_ticks=*/1 << 16, /*reserve_orders=*/1'000'000);
std::vector<Trade> fills;
book.add_limit(1, Side::Sell, /*price=*/101, /*qty=*/10, fills); // rests
book.add_limit(2, Side::Buy, /*price=*/101, /*qty=*/6, fills); // crosses -> 1 fill of 6
// fills == { Trade{taker=2, maker=1, price=101, qty=6} }
book.best_ask(); // 101 (4 left resting; valid only when has_ask())
book.modify(1, 101, 3, fills); // shrink at same price -> keeps time priority
book.add_ioc(4, Side::Buy, 101, 100, fills); // fills what crosses, discards the rest
book.add_fok(5, Side::Buy, 101, 100, fills); // all-or-nothing (0 if filled, else qty)
book.cancel(1); // true
book.add_market(6, Side::Buy, 100, fills); // sweeps the opposite side; never restsHeader-only: #include "orderbook/order_book.hpp" and add include/ to your path.
make test && make bench # or:
cmake -B build && cmake --build build && ctest --test-dir buildRequires a C++20 compiler. No dependencies.
include/orderbook/ order_book.hpp the engine (array ladder + intrusive FIFO pool)
occupancy.hpp two-level bitmap for O(1) best-level finding
flat_id_map.hpp open-addressing id->slot map (allocation-free)
types.hpp OrderId / Price / Qty / Side / Trade
tests/ test_orderbook.cpp differential fuzz + unit scenarios
test_occupancy.cpp bitmap vs brute force
test_flat_id_map.cpp map vs unordered_map + allocation audit
test_alloc_audit.cpp engine hot path is malloc-free
reference_book.hpp naive oracle (std::map/deque)
framework.hpp tiny zero-dep test harness
bench/ bench.cpp rest / cancel / match / mixed microbenchmarks
Deliberately focused: single-threaded, single-symbol, integer tick prices, limit +
market orders with price-time priority. Accessors expose sentinels on an empty book
(best_bid() == -1, best_ask() == num_ticks()), valid only when has_bid()/has_ask().
Qty is 32-bit and level totals are not saturated — fine within the intended range.
Natural extensions (not included): stop/pegged orders, per-symbol sharding, a lock-free
SPSC ingress, and L2/L3 snapshot streaming. The engine is the core; those are layers on
top of it.
MIT.