- Realtime Logging
- Working with flogfiles
- Local twistd.log files
- Adding log messages
- Log Messages During Unit Tests
Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to record information about what is happening inside the Tahoe node. This is primarily for use by programmers and grid operators who want to find out what went wrong.
The Foolscap logging system is documented at http://foolscap.lothar.com/docs/logging.html.
The Foolscap distribution includes a utility named "
flogtool" that is
used to get access to many Foolscap logging features. This command only
works when foolscap and its dependencies are installed correctly.
Tahoe-LAFS v1.10.0 and later include a
tahoe debug flogtool command
that can be used even when foolscap is not installed; to use this, prefix
all of the example commands below with
For earlier versions since Tahoe-LAFS v1.8.2, installing Foolscap v0.6.1
or later and then running
bin/tahoe @flogtool from the root of a
Tahoe-LAFS source distribution may work (but only on Unix, not Windows).
When you are working on Tahoe code, and want to see what the node is doing,
the easiest tool to use is "
flogtool tail". This connects to the Tahoe
node and subscribes to hear about all log events. These events are then
displayed to stdout, and optionally saved to a file.
flogtool tail" connects to the "logport", for which the FURL is stored
BASEDIR/private/logport.furl . The following command will connect to
this port and start emitting log information:
flogtool tail BASEDIR/private/logport.furl
--save-to FILENAME option will save all received events to a file,
where then can be examined later with "
flogtool dump" or "
--catch-up option will ask the node to dump all stored
events before subscribing to new ones (without
--catch-up, you will only
hear about events that occur after the tool has connected and subscribed).
Foolscap keeps a short list of recent events in memory. When something goes
wrong, it writes all the history it has (and everything that gets logged in
the next few seconds) into a file called an "incident". These files go into
BASEDIR/logs/incidents/ , in a file named
incident-TIMESTAMP-UNIQUE.flog.bz2". The default definition of
"something goes wrong" is the generation of a log event at the
level or higher, but other criteria could be implemented.
The typical "incident report" we've seen in a large Tahoe grid is about 40kB compressed, representing about 1800 recent events.
These "flogfiles" have a similar format to the files saved by "
tail --save-to". They are simply lists of log events, with a small header
to indicate which event triggered the incident.
flogtool dump FLOGFILE" command will take one of these
files and print their contents to stdout, one line per event. The raw event
dictionaries can be dumped by using "
flogtool dump --verbose FLOGFILE".
flogtool web-viewer" command can be used to examine the flogfile in
a web browser. It runs a small HTTP server and emits the URL on stdout. This
view provides more structure than the output of "
flogtool dump": the
parent/child relationships of log events is displayed in a nested format.
flogtool web-viewer" is still fairly immature.
Working with flogfiles
flogtool filter" command can be used to take a large flogfile
(perhaps one created by the log-gatherer, see below) and copy a subset of
events into a second file. This smaller flogfile may be easier to work with
than the original. The arguments to "
flogtool filter" specify filtering
criteria: a predicate that each event must match to be copied into the target
--after are used to exclude events outside a given
window of time.
--above will retain events above a certain severity
--from retains events send by a specific tubid.
--strip-facility removes events that were emitted with a given facility
In a deployed Tahoe grid, it is useful to get log information automatically transferred to a central log-gatherer host. This offloads the (admittedly modest) storage requirements to a different host and provides access to logfiles from multiple nodes (web-API, storage, or helper) in a single place.
There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each
produces a FURL which needs to be placed in the
NODEDIR/tahoe.cfg file of
each node that is to publish to the gatherer, under the keys
"log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe
node starts, it will connect to the configured gatherers and offer its
logport: the gatherer will then use the logport to subscribe to hear about
The gatherer will write to files in its working directory, which can then be
examined with tools like "
flogtool dump" as described above.
The "incident gatherer" only collects Incidents: records of the log events that occurred just before and slightly after some high-level "trigger event" was recorded. Each incident is classified into a "category": a short string that summarizes what sort of problem took place. These classification functions are written after examining a new/unknown incident. The idea is to recognize when the same problem is happening multiple times.
A collection of classification functions that are useful for Tahoe nodes are
misc/incident-gatherer/support_classifiers.py . There is
roughly one category for each
log.WEIRD-or-higher level event in the
Tahoe source code.
The incident gatherer is created with the "
create-incident-gatherer WORKDIR" command, and started with "
start". The generated "
gatherer.tac" file should be modified to add
The incident gatherer writes incident names (which are simply the relative
pathname of the
incident-\*.flog.bz2 file) into
For example, the
file contains a list of all incidents which were triggered by an
uncoordinated write that was detected during mutable file retrieval (caused
when somebody changed the contents of the mutable file in between the node's
mapupdate step and the retrieve step). The
contains a list of all incidents that did not match any of the classification
At startup, the incident gatherer will automatically reclassify any incident
report which is not mentioned in any of the
classified/\* files. So the
usual workflow is to examine the incidents in
classified/unknown, add a
new classification function, delete
classified/unknown, then bound the
gatherer with "
tahoe restart WORKDIR". The incidents which can be
classified with the new functions will be added to their own
classified/FOO lists, and the remaining ones will be put in
classified/unknown, where the process can be repeated until all events
The incident gatherer is still fairly immature: future versions will have a web interface and an RSS feed, so operations personnel can track problems in the storage grid.
In our experience, each incident takes about two seconds to transfer from the node that generated it to the gatherer. The gatherer will automatically catch up to any incidents which occurred while it is offline.
The "Log Gatherer" subscribes to hear about every single event published by
the connected nodes, regardless of severity. This server writes these log
events into a large flogfile that is rotated (closed, compressed, and
replaced with a new one) on a periodic basis. Each flogfile is named
according to the range of time it represents, with names like
from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2". The flogfiles
contain events from many different sources, making it easier to correlate
things that happened on multiple machines (such as comparing a client node
making a request with the storage servers that respond to that request).
Create the Log Gatherer with the "
flogtool create-gatherer WORKDIR"
command, and start it with "
tahoe start". Then copy the contents of the
log_gatherer.furl file it creates into the
(under the key
log_gatherer.furl of the section
[node]) of all nodes
that should be sending it log events. (See configuration.rst.)
flogtool filter" command, described above, is useful to cut down the
potentially large flogfiles into a more focussed form.
Busy nodes, particularly web-API nodes which are performing recursive deep-size/deep-stats/deep-check operations, can produce a lot of log events. To avoid overwhelming the node (and using an unbounded amount of memory for the outbound TCP queue), publishing nodes will start dropping log events when the outbound queue grows too large. When this occurs, there will be gaps (non-sequential event numbers) in the log-gatherer's flogfiles.
Local twistd.log files
[TODO: not yet true, requires foolscap-0.3.1 and a change to
In addition to the foolscap-based event logs, certain high-level events will
be recorded directly in human-readable text form, in the
BASEDIR/logs/twistd.log file (and its rotated old versions:
twistd.log.2, etc). This form does not contain as much
information as the flogfiles available through the means described
previously, but they are immediately available to the curious developer, and
are retained until the twistd.log.NN files are explicitly deleted.
Only events at the
log.OPERATIONAL level or higher are bridged to
twistd.log (i.e. not the
log.NOISY debugging events). In addition,
foolscap internal events (like connection negotiation messages) are not
Adding log messages
When adding new code, the Tahoe developer should add a reasonable number of new log events. For details, please see the Foolscap logging documentation, but a few notes are worth stating here:
- use a facility prefix of "
tahoe.", like "
- assign each severe (
log.WEIRDor higher) event a unique message identifier, as the
umid=argument to the
misc/coding_tools/make_umidscript may be useful for this purpose. This will make it easier to write a classification function for these messages.
- use the
parent=argument whenever the event is causally/temporally clustered with its parent. For example, a download process that involves three sequential hash fetches could announce the send and receipt of those hash-fetch messages with a
parent=argument that ties them to the overall download process. However, each new web-API download request should be unparented.
- use the
format=argument in preference to the
message=argument. E.g. use
log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)instead of
log.msg("got %d shares, need %d" % (n,k)). This will allow later tools to analyze the event without needing to scrape/reconstruct the structured data out of the formatted string.
- Pass extra information as extra keyword arguments, even if they aren't
included in the
format=string. This information will be displayed in the "
flogtool dump --verbose" output, as well as being available to other tools. The
umid=argument should be passed this way.
log.errfor the catch-all
addErrbackthat gets attached to the end of any given Deferred chain. When used in conjunction with
log.err()will tell Twisted about the error-nature of the log message, causing Trial to flunk the test (with an "ERROR" indication that prints a copy of the Failure, including a traceback). Don't use
log.errfor events that are
BADbut handled (like hash failures: since these are often deliberately provoked by test code, they should not cause test failures): use
log.msg(level=BAD)for those instead.
Log Messages During Unit Tests
If a test is failing and you aren't sure why, start by enabling
FLOGTOTWISTED=1 like this:
make test FLOGTOTWISTED=1
FLOGTOTWISTED=1, sufficiently-important log events will be written
_trial_temp/test.log, which may give you more ideas about why the
test is failing. Note, however, that
_trial_temp/log.out will not receive
messages below the
level=OPERATIONAL threshold, due to this issue:
If that isn't enough, look at the detailed foolscap logging messages instead, by running the tests like this:
make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
The first environment variable will cause foolscap log events to be written
./flog.out.bz2 (instead of merely being recorded in the circular
buffers for the use of remote subscribers or incident reports). The second
will cause all log events to be written out, not just the higher-severity
ones. The third will cause twisted log events (like the markers that indicate
when each unit test is starting and stopping) to be copied into the flogfile,
making it easier to correlate log events with unit tests.
Enabling this form of logging appears to roughly double the runtime of the
unit tests. The
flog.out.bz2 file is approximately 2MB.
You can then use "
flogtool dump" or "
flogtool web-viewer" on the
flogtool tail" and the log-gatherer are not useful during unit tests,
since there is no single Tub to which all the log messages are published).
It is possible for setting these environment variables to cause spurious test failures in tests with race condition bugs. All known instances of this have been fixed as of Tahoe-LAFS v1.7.1.