To achieve low-latency with minimal variance, while achieving high throughput, it is important to define a set of design principles that guide the development. When a trade off is required then the design principles help guide the decision in choosing between competing alternatives.
Many design principles come to bear on any implementation. Not all design principles require trade-offs but many do. The following set of design principles are key to the design of Aeron and the likely to drive most trade-offs. When two or more design principles conflict with each other then the one with the highest rank wins.
- Garbage free in steady state running
- Apply Smart Batching in the message path
- Lock-free algorithms in the message path
- Non-blocking IO in the message path
- No exceptional cases in the message path
- Apply the Single Writer Principle
- Prefer unshared state
- Avoid unnecessary data copies
When talking about the "message path" we are referring to the path taken through Aeron by data messages sent from a publisher to a subscriber, and not any internal messages Aeron may use for it operational purposes.
Garbage free in steady state running
Garbage collection pauses are one of the biggest causes of latency in a virtual machine. The best cases for these pauses are just under a millisecond which is a significantly longer duration than the round trip time being two computers over a local area network. Aeron should not contribute to GC pauses and thus makes it a candidate to be included in an application that carefully manages its own allocation.
Apply Smart Batching in the message path
Communications traffic tends to be very bursty in nature. To mitigate the queuing latency introduced by burst traffic then Smart Batching should be employed to amortise the expensive costs within an algorithm. Smart batching has the additional benefit of improving throughput by filling network packets to their optimal size.
Lock-free algorithms in the message path
A thread can be subject to an interrupt at any stage in its execution. If an algorithm is not lock-free then other threads can be blocked until the interrupted can resume. To avoid this blocking, all concurrent algorithms must be able to complete in a finite number of steps without blocking other threads. By being lock-free then by definition locks cannot be employed. By avoiding locks then one can avoid jitter introduced by lock inflation and re-biasing which can be significant, i.e. running into 10s of milliseconds as a JVM must reach safepoint and then be followed by OS scheduling effects.
Note: Algorithms can only be lock-free up until the point of making system calls to the media layer. For example, the standard Java NIO libraries use locks as does the Linux sockets library. However it is possible to integrate a user space network stack such as Open Onload which is lock-free.
Non-blocking IO in the message path
The wakeup cost by the scheduler for a thread blocked on IO can be greater than the RTT on a fast local network. Also a blocked thread cannot do other work while blocked. To avoid this cost and restriction, non-blocking IO should be employed.
No exceptional cases in the message path
To keep code paths simple and predictable to the processor, the main paths of execution in the program should not be burdened by the exceptional cases. For example the code to fragment and reassemble larger messages should not burden the path of the common small messages. Exceptional cases should take the exceptional cost.
Apply the Single Writer Principle
Contended access to mutable state requires mutual exclusion or conditional update protection. Either of these protection mechanisms cause queues to form as contended updates are applied. To avoid this contention and associated queueing effects all state should be owned by a single writer for mutation purposes, thus following the Single Writer Principle.
Prefer unshared state
Even when state is only updated by a single writer, the updates need to happen atomically with the correct memory ordering. This can involve data copies, memory fences, and algorithm complexity. It is better to have keep data private and local to individual threads. Local data can be optimal for access patterns and simple due to not being concurrent. Local unshared state can be updated by sending messages between threads to indicate state changes or events.
Avoid unnecessary data copies
Creating copies of data is relatively cheap on modern hardware, however it is not free. Copying of data will take time and involve the required cache lines being pulled through the cache sub-system, thus evicting other data. Data should only be copied when necessary to the algorithm or if significant complexity can be avoided. Data should not be copied without consideration to its life-cycle, that is one should not be inconsiderate or lazy.