Skip to content

Commit

Permalink
net-timestamp: expand documentation
Browse files Browse the repository at this point in the history
Expand Documentation/networking/timestamping.txt with new
interfaces and bytestream timestamping. Also minor
cleanup of the other text.

Import txtimestamp.c test of the new features.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
wdebruij authored and davem330 committed Sep 2, 2014
1 parent c5a6568 commit 8fe2f76
Show file tree
Hide file tree
Showing 3 changed files with 764 additions and 84 deletions.
368 changes: 286 additions & 82 deletions Documentation/networking/timestamping.txt
@@ -1,102 +1,307 @@
The existing interfaces for getting network packages time stamped are:

1. Control Interfaces

The interfaces for receiving network packages timestamps are:

* SO_TIMESTAMP
Generate time stamp for each incoming packet using the (not necessarily
monotonous!) system time. Result is returned via recv_msg() in a
control message as timeval (usec resolution).
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
control message as struct timeval (usec resolution).

* SO_TIMESTAMPNS
Same time stamping mechanism as SO_TIMESTAMP, but returns result as
timespec (nsec resolution).
Same timestamping mechanism as SO_TIMESTAMP, but reports the
timestamp as struct timespec (nsec resolution).

* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicasts: approximate send time stamp by receiving the looped
packet and using its receive time stamp.
Only for multicast:approximate transmit timestamp obtained by
reading the looped packet receive timestamp.

The following interface complements the existing ones: receive time
stamps can be generated and returned for arbitrary packets and much
closer to the point where the packet is really sent. Time stamps can
be generated in software (as before) or in hardware (if the hardware
has such a feature).
* SO_TIMESTAMPING
Generates timestamps on reception, transmission or both. Supports
multiple timestamp sources, including hardware. Supports generating
timestamps for stream sockets.

SO_TIMESTAMPING:

Instructs the socket layer which kind of information should be collected
and/or reported. The parameter is an integer with some of the following
bits set. Setting other bits is an error and doesn't change the current
state.
1.1 SO_TIMESTAMP:

Four of the bits are requests to the stack to try to generate
timestamps. Any combination of them is valid.
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
the network stack, the feature has to be enabled for all packets. The
same is true for all early receive timestamp options.

SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware
SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software
SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware
SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software
For interface details, see `man 7 socket`.


1.2 SO_TIMESTAMPNS:

This option is identical to SO_TIMESTAMP except for the returned data type.
Its struct timespec allows for higher resolution (ns) timestamps than the
timeval of SO_TIMESTAMP (ms).


1.3 SO_TIMESTAMPING:

Supports multiple types of timestamp requests. As a result, this
socket option takes a bitmap of flags, not a boolean. In

err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);

val is an integer with any of the following bits set. Setting other
bit returns EINVAL and does not change the current state.

The other three bits control which timestamps will be reported in a
generated control message. If none of these bits are set or if none of
the set bits correspond to data that is available, then the control
message will not be generated:

SOF_TIMESTAMPING_SOFTWARE: report systime if available
SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated)
SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
1.3.1 Timestamp Generation

It is worth noting that timestamps may be collected for reasons other
than being requested by a particular socket with
SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that
can generate hardware receive timestamps ignore
SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag
in case future drivers pay attention.
Some bits are requests to the stack to try to generate timestamps. Any
combination of them is valid. Changes to these bits apply to newly
created packets, not to packets already in the stack. As a result, it
is possible to selectively request timestamps for a subset of packets
(e.g., for sampling) by embedding an send() call within two setsockopt
calls, one to enable timestamp generation and one to disable it.
Timestamps may also be generated for reasons other than being
requested by a particular socket, such as when receive timestamping is
enabled system wide, as explained earlier.

If timestamps are reported, they will appear in a control message with
cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like
this:
SOF_TIMESTAMPING_RX_HARDWARE:
Request rx timestamps generated by the network adapter.

SOF_TIMESTAMPING_RX_SOFTWARE:
Request rx timestamps when data enters the kernel. These timestamps
are generated just after a device driver hands a packet to the
kernel receive stack.

SOF_TIMESTAMPING_TX_HARDWARE:
Request tx timestamps generated by the network adapter.

SOF_TIMESTAMPING_TX_SOFTWARE:
Request tx timestamps when data leaves the kernel. These timestamps
are generated in the device driver as close as possible, but always
prior to, passing the packet to the network interface. Hence, they
require driver support and may not be available for all devices.

SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel
transmit latency is, if long, often dominated by queuing delay. The
difference between this timestamp and one taken at
SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
of protocol processing. The latency incurred in protocol
processing, if any, can be computed by subtracting a userspace
timestamp taken immediately before send() from this timestamp. On
machines with virtual devices where a transmitted packet travels
through multiple devices and, hence, multiple packet schedulers,
a timestamp is generated at each layer. This allows for fine
grained measurement of queuing delay.

SOF_TIMESTAMPING_TX_ACK:
Request tx timestamps when all data in the send buffer has been
acknowledged. This only makes sense for reliable protocols. It is
currently only implemented for TCP. For that protocol, it may
over-report measurement, because the timestamp is generated when all
data up to and including the buffer at send() was acknowledged: the
cumulative acknowledgment. The mechanism ignores SACK and FACK.


1.3.2 Timestamp Reporting

The other three bits control which timestamps will be reported in a
generated control message. Changes to the bits take immediate
effect at the timestamp reporting locations in the stack. Timestamps
are only reported for packets that also have the relevant timestamp
generation request set.

SOF_TIMESTAMPING_SOFTWARE:
Report any software timestamps when available.

SOF_TIMESTAMPING_SYS_HARDWARE:
This option is deprecated and ignored.

SOF_TIMESTAMPING_RAW_HARDWARE:
Report hardware timestamps as generated by
SOF_TIMESTAMPING_TX_HARDWARE when available.


1.3.3 Timestamp Options

The interface supports one option

SOF_TIMESTAMPING_OPT_ID:

Generate a unique identifier along with each packet. A process can
have multiple concurrent timestamping requests outstanding. Packets
can be reordered in the transmit path, for instance in the packet
scheduler. In that case timestamps will be queued onto the error
queue out of order from the original send() calls. This option
embeds a counter that is incremented at send() time, to order
timestamps within a flow.

This option is implemented only for transmit timestamps. There, the
timestamp is always looped along with a struct sock_extended_err.
The option modifies field ee_info to pass an id that is unique
among all possibly concurrently outstanding timestamp requests for
that socket. In practice, it is a monotonically increasing u32
(that wraps).

In datagram sockets, the counter increments on each send call. In
stream sockets, it increments with every byte.


1.4 Bytestream Timestamps

The SO_TIMESTAMPING interface supports timestamping of bytes in a
bytestream. Each request is interpreted as a request for when the
entire contents of the buffer has passed a timestamping point. That
is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
when all bytes have reached the device driver, regardless of how
many packets the data has been converted into.

In general, bytestreams have no natural delimiters and therefore
correlating a timestamp with data is non-trivial. A range of bytes
may be split across segments, any segments may be merged (possibly
coalescing sections of previously segmented buffers associated with
independent send() calls). Segments can be reordered and the same
byte range can coexist in multiple segments for protocols that
implement retransmissions.

It is essential that all timestamps implement the same semantics,
regardless of these possible transformations, as otherwise they are
incomparable. Handling "rare" corner cases differently from the
simple case (a 1:1 mapping from buffer to skb) is insufficient
because performance debugging often needs to focus on such outliers.

In practice, timestamps can be correlated with segments of a
bytestream consistently, if both semantics of the timestamp and the
timing of measurement are chosen correctly. This challenge is no
different from deciding on a strategy for IP fragmentation. There, the
definition is that only the first fragment is timestamped. For
bytestreams, we chose that a timestamp is generated only when all
bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
implement and reason about. An implementation that has to take into
account SACK would be more complex due to possible transmission holes
and out of order arrival.

On the host, TCP can also break the simple 1:1 mapping from buffer to
skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
implementation ensures correctness in all cases by tracking the
individual last byte passed to send(), even if it is no longer the
last byte after an skbuff extend or merge operation. It stores the
relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
has only one such field, only one timestamp can be generated.

In rare cases, a timestamp request can be missed if two requests are
collapsed onto the same skb. A process can detect this situation by
enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
send time with the value returned for each timestamp. It can prevent
the situation by always flushing the TCP stack in between requests,
for instance by enabling TCP_NODELAY and disabling TCP_CORK and
autocork.

These precautions ensure that the timestamp is generated only when all
bytes have passed a timestamp point, assuming that the network stack
itself does not reorder the segments. The stack indeed tries to avoid
reordering. The one exception is under administrator control: it is
possible to construct a packet scheduler configuration that delays
segments from the same stream differently. Such a setup would be
unusual.


2 Data Interfaces

Timestamps are read using the ancillary data feature of recvmsg().
See `man 3 cmsg` for details of this interface. The socket manual
page (`man 7 socket`) describes how timestamps generated with
SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.


2.1 SCM_TIMESTAMPING records

These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type

struct scm_timestamping {
struct timespec systime;
struct timespec hwtimetrans;
struct timespec hwtimeraw;
struct timespec ts[3];
};

recvmsg() can be used to get this control message for regular incoming
packets. For send time stamps the outgoing packet is looped back to
the socket's error queue with the send time stamp(s) attached. It can
be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
original outgoing packet data including all headers preprended down to
and including the link layer, the scm_timestamping control message and
a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
bounced packet is ready for reading as far as select() is concerned.
If the outgoing packet has to be fragmented, then only the first
fragment is time stamped and returned to the sending socket.

All three values correspond to the same event in time, but were
generated in different ways. Each of these values may be empty (= all
zero), in which case no such value was available. If the application
is not interested in some of these values, they can be left blank to
avoid the potential overhead of calculating them.

systime is the value of the system time at that moment. This
corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
time stamp was generated by hardware, then this field is
empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
set.

hwtimeraw is the original hardware time stamp. Filled in if
SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
relation to system time should be made.

hwtimetrans is always zero. This field is deprecated. It used to hold
hw timestamps converted to system time. Instead, expose the hardware
clock device on the NIC directly as a HW PTP clock source, to allow
time conversion in userspace and optionally synchronize system time
with a userspace PTP stack such as linuxptp. For the PTP clock API,
see Documentation/ptp/ptp.txt.


SIOCSHWTSTAMP, SIOCGHWTSTAMP:
The structure can return up to three timestamps. This is a legacy
feature. Only one field is non-zero at any time. Most timestamps
are passed in ts[0]. Hardware timestamps are passed in ts[2].

ts[1] used to hold hardware timestamps converted to system time.
Instead, expose the hardware clock device on the NIC directly as
a HW PTP clock source, to allow time conversion in userspace and
optionally synchronize system time with a userspace PTP stack such
as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt.

2.1.1 Transmit timestamps with MSG_ERRQUEUE

For transmit timestamps the outgoing packet is looped back to the
socket's error queue with the send timestamp(s) attached. A process
receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
set and with a msg_control buffer sufficiently large to receive the
relevant metadata structures. The recvmsg call returns the original
outgoing data packet with two ancillary messages attached.

A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
embeds a struct sock_extended_err. This defines the error type. For
timestamps, the ee_errno field is ENOMSG. The other ancillary message
will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
embeds the struct scm_timestamping.


2.1.1.2 Timestamp types

The semantics of the three struct timespec are defined by field
ee_info in the extended error structure. It contains a value of
type SCM_TSTAMP_* to define the actual timestamp passed in
scm_timestamping.

The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
control fields discussed previously, with one exception. For legacy
reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
is the first if ts[2] is non-zero, the second otherwise, in which
case the timestamp is stored in ts[0].


2.1.1.3 Fragmentation

Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
then only the first fragment is timestamped and returned to the sending
socket.


2.1.1.4 Packet Payload

The calling application is often not interested in receiving the whole
packet payload that it passed to the stack originally: the socket
error queue mechanism is just a method to piggyback the timestamp on.
In this case, the application can choose to read datagrams with a
smaller buffer, possibly even of length 0. The payload is truncated
accordingly. Until the process calls recvmsg() on the error queue,
however, the full packet is queued, taking up budget from SO_RCVBUF.


2.1.1.5 Blocking Read

Reading from the error queue is always a non-blocking operation. To
block waiting on a timestamp, use poll or select. poll() will return
POLLERR in pollfd.revents if any data is ready on the error queue.
There is no need to pass this flag in pollfd.events. This flag is
ignored on request. See also `man 2 poll`.


2.1.2 Receive timestamps

On reception, there is no reason to read from the socket error queue.
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
on a normal recvmsg(). Since this is not a socket error, it is not
accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
the meaning of the three fields in struct scm_timestamping is
implicitly defined. ts[0] holds a software timestamp if set, ts[1]
is again deprecated and ts[2] holds a hardware timestamp if set.


3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP

Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
Expand Down Expand Up @@ -167,8 +372,7 @@ enum {
*/
};


DEVICE IMPLEMENTATION
3.1 Hardware Timestamping Implementation: Device Drivers

A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
Expand Down

0 comments on commit 8fe2f76

Please sign in to comment.