Skip to content

Monitoring and Debugging

Dmytro Vyazelenko edited this page May 2, 2024 · 77 revisions

Debugging and monitoring distributed systems can be a challenge. Aeron has been designed in an open fashion so that much of its internal state can be observed during operation. By taking an open approach we hope to simplify this challenge. If you have suggestions for improvements then we would love to hear them. We try to accommodate suggestions that will benefit the Aeron user community provided they do not violate the Design Principles. Suggestions can be posted to the Issues. Some of the tools described below can be found in the samples module. These tools tend to be quite simple and can be easily customised.

Scripts are also available in the samples module to make using these tools a little easier.

  1. Errors
  2. System and Position Counters
  3. Log Inspection
  4. Debug Logging
  5. Loss Reporting
  6. Backlog Reporting

Errors

Rather than take the approach of using log files, which can fill disks due to chronic issues, Aeron records errors to a section of its CnC (Command and Control) file as distinct errors with a count of observations plus time of first and last observation. This means that when the same error is experienced many times only the count and latest observation timestamp is updated. In the unlikely event of this distinct error log in the CnC file filling then further errors are sent to STDERR. The amount of space required for distinct errors is typically not very large and can be configured with the following system property to the Media Driver. Remember only distinct errors are recorded in full.

aeron.error.buffer.length=<default is 1MB>

If errors are present in the distinct error log when the driver starts then they will be copied to a file of the name <timestamp>-error.log in text format. The log is cleared down on each invocation of the Media Driver.

The command line tool ErrorStat can be used to read the error log at any time during Media Driver operation, or even after the Media Driver has been shutdown and the CnC file is still in existence.

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.ErrorStat

Counters

Aeron tracks a lot of its internal state as AtomicCounters in a memory mapped file. This file can be read by any external process with no significant impact on performance. These counters are divided into two groups.

  1. System Counters: Counters of significant events observed in the system such as errors, counts, rates, and hints that further investigation should be taken.
  2. Stream Counters: Counters for tracking and limiting progress on byte streams of messages.

Counters can be inspected for any Media Driver with the AeronStat command line tool:

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.AeronStat [filter options]

AeronStat will run continuously and output the counters once per second. The default is to output all counters if no filter criteria is provided.

# All system counters
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.AeronStat type=0

# For just the count of errors
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.AeronStat type=0 identity=15

The filter criteria options on stream position counters are:

  1. type: This is the type id of the counter; 0 is system counters; 1-4 are some of the stream counters.
  2. identity: The key which identifies the counter within its type scope.
  3. session: Session id to be used with position counters.
  4. stream: Stream id to be used with position counters.
  5. channel: Channel to be used with position counters.

The filter criteria are regular expressions which can be useful for filtering out a range of streams.

# All position counters
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.AeronStat type=[1-4]

# The counters for a specific stream
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.AeronStat type=[1-4] \
  session=123456 stream=10 'channel=aeron:udp\?endpoint=localhost:40123'

Note: Remember to escape special characters in the filter criteria, e.g. the "?" character in the channel filter.

To get a rolled up view of streams with associated counters all on one line then you can use the StreamStat tool:

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  [-Daeron.dir=<path to aeron directory>] \
  io.aeron.samples.StreamStat

Log Inspection

It is possible to inspect the contents of either a publication log buffer on the producer side of a connection, or the rebuilt image on the consumer side, with the LogInspector command line tool. This tool takes two arguments. First is the filename of the log to be inspected. The second optional argument is a limit in number of bytes to be dumped for the body of each message in the log. Additionally, the following system properties can be provided to customise the output:

  • aeron.log.inspector.data.format: configures the output format for the body of each message which defaults to hex. The other supported value is ascii which can be useful when you know the message contents are strings.
  • aeron.log.inspector.skipDefaultHeader: is a boolean flag indicating if the default header output should be skipped, defaults to false. Valid values are true and false.
  • aeron.log.inspector.scanOverZeroes: should the inspector skip of zeros in the file, defaults to false. Useful for scanning a log that joined late or experienced loss.
# Dump the contents of a log with up to 200 bytes of each message hex.
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.LogInspector \
  /dev/shm/aeron-mjpt777/publications/<filename>.logbuffer 200 > dump.txt

# Dump the contents of a log with up to 50 bytes of each message in ASCII.
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.LogInspector \
  -Daeron.log.inspector.data.format=ascii \
  /dev/shm/aeron-mjpt777/images/<filename>.logbuffer 50 > dump.txt

This also works for archive segment files:

# Dump the contents of a segment file with up to 50 bytes of each message in ASCII.
$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  -Daeron.log.inspector.data.format=ascii \
  io.aeron.samples.archive.SegmentInspector \
  <segment-filename>.rec 50 > dump.txt

Debug Logging

Aeron does not take the common approach of littering a code base with logging statements to aid debugging. Instead logging statements are dynamically woven into a running system via a Java Agent. This allows for byte code weaving to dynamically add logging statements into the code where required. An example Java for adding logging can be found in the aeron-agent module.

The event logging agent can be added to any components upon start up as follows:

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  -javaagent:aeron-agent/build/libs/aeron-agent-<version>.jar \
  -Daeron.event.cluster.log=all \
  -Daeron.event.cluster.log.disable=CANVASS_POSITION,APPEND_POSITION,COMMIT_POSITION \
  io.aeron.cluster.ConsensusModule
  • aeron.event.log.filename: System property for the file to which the log is appended. If not set then STDOUT will be used. Logging to file is significantly more efficient than logging to STDOUT.

  • aeron.event.buffer.length: System property for length of the in-memory buffer used between the capture pointcut and the log reader. Defaults to 8MB.

  • aeron.event.log.reader.classname: System property for the log reader class which consumes the event buffer. Defaults to io.aeron.agent.EventLogReaderAgent.

  • aeron.event.log: System property containing a comma separated list of Driver event codes. These are also special events: all for all possible events and admin for administration events.

  • aeron.event.log.disable: System property containing a comma separated list of Driver events to exclude.

  • aeron.event.archive.log: System property containing a comma separated list of Archive event codes. all can be used to enable all Archive events.

  • aeron.event.archive.log.disable: System property containing a comma separated list of Archive events to exclude.

  • aeron.event.cluster.log: System property containing a comma separated list of Cluster event codes. all can be used to enable all Cluster events.

  • aeron.event.cluster.log.disable: System property containing a comma separated list of Cluster events to exclude.

    Note: When using all be careful as these logs can be very verbose e.g. all frames on the Media Driver or all position events in Cluster. New events may also appear with newer releases, which could increase the amount of data in the debug log.

Loss Reporting

When loss is detected at the receiver side it is logged to the Loss Report (loss-report.dat) in the Aeron driver directory. The log contains an aggregate entry reporting the loss by stream for the number of times loss was observed, total bytes lost, time of first observation, time of last observation, and the details for the stream. This report can be read with the LossStat tool or by parsing the file as the format is published. The LossStat tool will output to STDOUT in a CSV format for storage and later analysis.

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.LossStat

Backlog Reporting

As data streams flow from publisher to sender, then across the network to receiver, before being consumed by the subscribers it can be buffered between each stage. The Backlog Stat tool inspects the counters and builds a report by stream showing the backlog in bytes buffered between each stage.

$ java -cp aeron-all/build/libs/aeron-all-<version>.jar \
  io.aeron.samples.BacklogStat