diff --git a/pping/README.md b/pping/README.md index e49e3cff..74c589b5 100644 --- a/pping/README.md +++ b/pping/README.md @@ -3,31 +3,99 @@ A re-implementation of [Kathie Nichols' passive ping (pping)](https://github.com/pollere/pping) utility using XDP (on ingress) and TC-BPF (on egress) for the packet capture logic. +## Why eBPF pping? +When evaluating network performance the focus has traditionally been on +throughput. However, for many applications latency may be an equally or even +more important metric. The [bufferbloat +project](https://www.bufferbloat.net/projects/) has for many years tried to +raise awareness of the importance of network latency, and more specifically the +negative effects on QoE that an increase in latency due to oversized packet +buffers can have. Being able to measure network latency is therefore essential +for understanding network performance, and may also prove useful when +troubleshooting applications or network missconfigurations. + +The most well known tool for measuring network latency is probably ping, which +reports a Round Trip Time (RTT) to a target node by sending a message and +measuring the time until it gets a response. Due to being standardized as part +of the ICMP protocol, ping is universally available and usually a good first +choice to determine the idle latency between two specific nodes. But the +fundamental approach of actively sending out additional network traffic to +measure network latency has several problems. + +1. It introduces additional network overhead. While a single ICMP packet every +second is negligible on most links, increasing the granularity of RTT reports +requires sending packets at a higher rate which could add up to considerable +overhead on slower links. + +2. It only reports the RTT between a single pair of nodes. To get overview of +the latency in a large network would require running ping between every possible +pairing which is cumbersome and does not scale well. + +3. It only provides the latency experienced by the ICMP echos. This latency does +not necessarily correspond with latency experienced by traffic from other +applications. Running ping in isolation from other traffic will likely fail to +capture an increase in latency that can be caused by buffer bloat. Even if ping +is run concurrently with other traffic, the ping traffic may be treated +differently due to for example active queue management, or even routed +differently because of e.g. a load balancer. + +Passive ping (pping) uses a different approach that avoids these issues. Instead +of sending out additional network traffic, pping looks at existing traffic and +reports the RTT experienced by this traffic. This means that pping adds no +network overhead, can report RTTs to any hosts for which regular network traffic +is passing through, regardless if it's being run on an endhost or a middlebox, +and the reported latency corresponds to the network latency experienced by the +real application traffic. + +Kathleen Nichols proved the feasibility of this approach by implementing pping +for TCP traffic, based on the TCP timestamp option. Kathie's C++ implementation, +like most userspace programs, uses the traditional but rather inefficient +technique of copying packet headers to userspace and parsing them there. At high +line rates copying all packet headers to userspace is very resource demanding, +and it may not be possible for the program to keep up with the network traffic, +leading to it missing packets. + +With eBPF pping we want to leverage the power of eBPF to fix this +inefficiency. Using eBPF, the packets can be parsed directly in kernel space +while they pass through the network stack, without ever being copied to +userspace. This approach allows pping to keep up with higher line rates and +imposes less overhead. Furthermore, eBPF pping adds some additional features, +such as JSON output, and extends it beyond TCP so it can be used to monitor a +wider range of traffic. + ## Simple description Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It can be used on endhosts as well as any (BPF-capable Linux) device which can see -both directions of the traffic (ex router or middlebox). Currently it only works -for TCP traffic which uses the TCP timestamp option, but could be extended to -also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP -echo-reply messages. See the [TODO-list](./TODO.md) for more potential features -(which may or may not ever get implemented). +both directions of the traffic (ex router or middlebox). Currently it works for +TCP traffic which uses the TCP timestamp option and ICMP echo messages, but +could be extended to also work with for example TCP seq/ACK numbers, the QUIC +spinbit and DNS queries. See the [TODO-list](./TODO.md) for more potential +features (which may or may not ever get implemented). The fundamental logic of pping is to timestamp a pseudo-unique identifier for -outgoing packets, and then look for matches in the incoming packets. If a match -is found, the RTT is simply calculated as the time difference between the -current time and the stored timestamp. +packets, and then look for matches in the reply packets. If a match is found, +the RTT is simply calculated as the time difference between the current time and +the stored timestamp. This tool, just as Kathie's original pping implementation, uses TCP timestamps -as identifiers. For outgoing packets, the TSval (which is a timestamp in and off -itself) is timestamped. Incoming packets are then parsed for the TSecr, which -are the echoed TSval values from the receiver. The TCP timestamps are not -necessarily unique for every packet (they have a limited update frequency, -appears to be 1000 Hz for modern Linux systems), so only the first instance of -an identifier is timestamped, and matched against the first incoming packet with -the identifier. The mechanism to ensure only the first packet is timestamped and -matched differs from the one in Kathie's pping, and is further described in +as identifiers for TCP traffic. The TSval (which is a timestamp in and off +itself) is used as an identifier and timestamped. Reply packets in the reverse +flow are then parsed for the TSecr, which are the echoed TSval values from the +receiver. The TCP timestamps are not necessarily unique for every packet (they +have a limited update frequency, appears to be 1000 Hz for modern Linux +systems), so only the first instance of an identifier is timestamped, and +matched against the first incoming packet with a matching reply identifier. The +mechanism to ensure only the first packet is timestamped and matched differs +from the one in Kathie's pping, and is further described in [SAMPLING_DESIGN](./SAMPLING_DESIGN.md). +For ICMP echo, it uses the echo identifier as port numbers, and echo sequence +number as identifer to match against. Linux systems will typically use different +echo identifers for different instances of ping, and thus each ping instance +will be recongnized as a separate flow. Windows systems typically use a static +echo identifer, and thus all instaces of ping originating from a particular +Windows host and the same target host will be considered a single flow. + ## Output formats pping currently supports 3 different formats, *standard*, *ppviz* and *json*. In general, the output consists of two different types of events, flow-events which @@ -41,12 +109,12 @@ single line per event. An example of the format is provided below: ```shell -16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src -16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528 -16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528 -16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528 -16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528 -16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest +16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest +16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528 +16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528 +16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528 +16:00:49.152489316 6.017589 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528 +16:00:49.878508114 TCP 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest ``` ### ppviz format @@ -89,7 +157,7 @@ An example of a (pretty-printed) flow-event is provided below: "protocol": "TCP", "flow_event": "opening", "reason": "SYN-ACK", - "triggered_by": "src" + "triggered_by": "dest" } ``` @@ -107,39 +175,37 @@ An example of a (pretty-printed) RTT-even is provided below: "sent_packets": 9393, "sent_bytes": 492457296, "rec_packets": 5922, - "rec_bytes": 37 + "rec_bytes": 37, + "match_on_egress": false } ``` ## Design and technical description !["Design of eBPF pping](./eBPF_pping_design.png) +eBPF pping consists of two major components, the kernel space BPF program and +the userspace program. The BPF program parses incoming and outgoing packets, and +uses BPF maps to store packet timestamps as well as some state about each +flow. When the BPF program can match a reply packet against one of the stored +packet timestamps, it pushes the calculated RTT to the userspace program which +in turn prints it out. + ### Files: - **pping.c:** Userspace program that loads and attaches the BPF programs, pulls - the perf-buffer `rtt_events` to print out RTT messages and periodically cleans + the perf-buffer `events` to print out RTT messages and periodically cleans up the hash-maps from old entries. Also passes user options to the BPF programs by setting a "global variable" (stored in the programs .rodata section). -- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and - XDP (ingress), as well as several common functions, a global constant `config` - (set from userspace) and map definitions. The tc program `pping_egress()` - parses outgoing packets for identifiers. If an identifier is found and the - sampling strategy allows it, a timestamp for the packet is created in - `packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an - identifier. If found, it looks up the `packet_ts` map for a match on the - reverse flow (to match source/dest on egress). If there is a match, it - calculates the RTT from the stored timestamp and deletes the entry. The - calculated RTT (together with the flow-tuple) is pushed to the perf-buffer - `events`. Both `pping_egress()` and `pping_ingress` can also push flow-events - to the `events` buffer. -- **bpf_egress_loader.sh:** A shell script that's used by `pping.c` to setup a - clsact qdisc and attach the `pping_egress()` program to egress using - tc. **Note**: Unless your iproute2 comes with libbpf support, tc will use - iproute's own loading mechanism when loading and attaching object files - directly through the tc command line. To ensure that libbpf is always used to - load `pping_egress()`, `pping.c` actually loads the program and pins it to - `/sys/fs/bpf/pping/classifier`, and tc only attaches the pinned program. -- **functions.sh and parameters.sh:** Imported by `bpf_egress_loader.sh`. +- **pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and + ingress (XDP or tc), as well as several common functions, a global constant + `config` (set from userspace) and map definitions. Essentially the same pping + program is loaded on both ingress and egress. All packets are parsed for both + an identifier that can be used to create a timestamp entry `packet_ts`, and a + reply identifier that can be used to match the packet with a previously + timestamped one in the reverse flow. If a match is found, an RTT is calculated + and an RTT-event is pushed to userspace through the perf-buffer `events`. For + each packet with a valid identifier, the program also keeps track of and + updates the state flow and reverse flow, stored in the `flow_state` map. - **pping.h:** Common header file included by `pping.c` and `pping_kern.c`. Contains some common structs used by both (are part of the maps). @@ -147,13 +213,12 @@ An example of a (pretty-printed) RTT-even is provided below: ### BPF Maps: - **flow_state:** A hash-map storing some basic state for each flow, such as the last seen identifier for the flow and when the last timestamp entry for the - flow was created. Entries are created by `pping_egress()`, and can be updated - or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries - are eventually removed by `pping.c`. + flow was created. Entries are created, updated and deleted by the BPF pping + programs. Leftover entries are eventually removed by userspace (`pping.c`). - **packet_ts:** A hash-map storing a timestamp for a specific packet - identifier. Entries are created by `pping_egress()` and removed by - `pping_ingress()` if a match is found. Leftover entries are eventually removed - by `pping.c`. + identifier. Entries are created by the BPF pping program if a valid identifier + is found, and removed if a match is found. Leftover entries are eventually + removed by userspace (`pping.c`). - **events:** A perf-buffer used by the BPF programs to push flow or RTT events to `pping.c`, which continuously polls the map the prints them out. @@ -204,8 +269,8 @@ these identifiers. This issue could be avoided entirely by requiring that new-id > old-id instead of simply checking that new-id != old-id, as TCP timestamps should monotonically -increase. That may however not be a suitable solution if/when we add support for -other types of identifiers. +increase. That may however not be a suitable solution for other types of +identifiers. #### Rate-limiting new timestamps In the tc/egress program packets to timestamp are sampled by using a per-flow @@ -223,9 +288,9 @@ additional map space and report some additional RTT(s) more than expected (however the reported RTTs should still be correct). If the packets have the same identifier, they must first have managed to bypass -the previous check for unique identifiers (see [previous point](#Tracking last -seen identifier)), and only one of them will be able to successfully store a -timestamp entry. +the previous check for unique identifiers (see [previous +point](#tracking-last-seen-identifier)), and only one of them will be able to +successfully store a timestamp entry. #### Matching against stored timestamps The XDP/ingress program could potentially match multiple concurrent packets with @@ -247,14 +312,28 @@ if this is the lowest RTT seen so far for the flow. If multiple RTTs are calculated concurrently, then several could pass this check concurrently and there may be a lost update. It should only be possible for multiple RTTs to be calculated concurrently in case either the [timestamp rate-limit was -bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match -against the same timestamp](#Matching against stored timestamps). +bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match +against the same timestamp](#matching-against-stored-timestamps). It's worth noting that with sampling the reported minimum-RTT is only an estimate anyways (may never calculate RTT for packet with the true minimum RTT). And even without sampling there is some inherent sampling due to TCP timestamps only being updated at a limited rate (1000 Hz). +#### Outputting flow opening/closing events +A flow is not considered opened until a reply has been seen for it. The +`flow_state` map keeps information about if the flow has been opened or not, +which is checked and updated for each reply. The check and update of this +information is not performed atomically, which may result in multiple replies +thinking they are the first, emitting multiple flow-opened events, in case they +are processed concurrently. + +Likewise, when flows are closed it checks if the flow has been opened to +determine if a flow closing message should be sent. If multiple replies are +processed concurrently, it's possible one of them will update the flow-open +information and emit a flow opening message, but another reply closing the flow +without thinking it's ever been opened, thus not sending a flow closing message. + ## Similar projects Passively measuring the RTT for TCP traffic is not a novel concept, and there exists a number of other tools that can do so. A good overview of how passive diff --git a/pping/TODO.md b/pping/TODO.md index 1f792db6..8e592ddc 100644 --- a/pping/TODO.md +++ b/pping/TODO.md @@ -6,20 +6,23 @@ - Timestamping pure ACKs may lead to erroneous RTTs (ex. delay between application attempting to send data being recognized as an RTT) + - [x] Skip non-ACKs for ingress + - The echoed TSecr is not valid if the ACK-flag is not set - [ ] Add fallback to SEQ/ACK in case of no timestamp? - Some machines may not use TCP timestamps (either not supported at all, or disabled as in ex. Windows 10) - If one only considers SEQ/ACK (and don't check for SACK options), could result in ex. delay from retransmission being included in RTT -- [ ] ICMP (ex Echo/Reply) +- [x] ICMP (ex Echo/Reply) - [ ] QUIC (based on spinbit) +- [ ] DNS queries ## General pping - [x] Add sampling so that RTT is not calculated for every packet (with unique value) for large flows - [ ] Allow short bursts to bypass sampling in order to handle - delayed ACKs + delayed ACKs, reordered or lost packets etc. - [x] Keep some per-flow state - Will likely be needed for the sampling - [ ] Could potentially include keeping track of average RTT, which diff --git a/pping/bpf_egress_loader.sh b/pping/bpf_egress_loader.sh deleted file mode 100755 index f15de654..00000000 --- a/pping/bpf_egress_loader.sh +++ /dev/null @@ -1,102 +0,0 @@ -#!/bin/bash -# -# Author: Jesper Dangaaard Brouer -# License: GPLv2 -# -# Modified by Simon Sundberg to add support -# of optional section (--sec) option or attaching a pinned program -# -basedir=`dirname $0` -source ${basedir}/functions.sh - -root_check_run_with_sudo "$@" - -# Use common parameters -source ${basedir}/parameters.sh - -export TC=/sbin/tc - -# This can be changed via --file or --obj -if [[ -z ${BPF_OBJ} ]]; then - # Fallback default - BPF_OBJ=pping_kern_tc.o -fi - -# This can be changed via --sec -if [[ -z ${SEC} ]]; then - # Fallback default - SEC=pping_egress -fi - -info "Applying TC-BPF egress setup on device: $DEV with object file: $BPF_OBJ" - -function tc_remove_clsact() -{ - local device=${1:-$DEV} - shift - - # Removing qdisc clsact, also deletes all filters - call_tc_allow_fail qdisc del dev "$device" clsact 2> /dev/null -} - -function tc_init_clsact() -{ - local device=${1:-$DEV} - shift - - # TODO: find method that avoids flushing (all users) - - # Also deletes all filters - call_tc_allow_fail qdisc del dev "$device" clsact 2> /dev/null - - # Load qdisc clsact which allow us to attach BPF-progs as TC filters - call_tc qdisc add dev "$device" clsact -} - -function tc_egress_bpf_attach() -{ - local device=${1:-$DEV} - local objfile=${2:-$BPF_OBJ} - local section=${3:-$SEC} - shift 3 - - call_tc filter add dev "$device" pref 2 handle 2 \ - egress bpf da obj "$objfile" sec "$section" -} - -function tc_egress_bpf_attach_pinned() -{ - local device=${1:-$DEV} - local pinprog=${2:-$PIN_PROG} - shift 2 - - call_tc filter add dev "$device" pref 2 handle 2 \ - egress bpf da pinned "$pinprog" -} - -function tc_egress_list() -{ - local device=${1:-$DEV} - - call_tc filter show dev "$device" egress -} - -if [[ -n $REMOVE ]]; then - tc_remove_clsact $DEV - exit 0 -fi - -tc_init_clsact $DEV - -if [[ -n $PIN_PROG ]]; then - tc_egress_bpf_attach_pinned $DEV $PIN_PROG -else - tc_egress_bpf_attach $DEV $BPF_OBJ $SEC -fi - -# Practical to list egress filters after setup. -# (It's a common mistake to have several progs loaded) -if [[ -n $LIST ]]; then - info "Listing egress filter on device" - tc_egress_list $DEV -fi diff --git a/pping/eBPF_pping_design.png b/pping/eBPF_pping_design.png index ab910020..8423b25b 100644 Binary files a/pping/eBPF_pping_design.png and b/pping/eBPF_pping_design.png differ diff --git a/pping/functions.sh b/pping/functions.sh deleted file mode 100644 index a92f4820..00000000 --- a/pping/functions.sh +++ /dev/null @@ -1,64 +0,0 @@ -# -# Common functions used by scripts in this directory -# - Depending on bash 3 (or higher) syntax -# -# Author: Jesper Dangaaard Brouer -# License: GPLv2 - -## -- sudo trick -- -function root_check_run_with_sudo() { - # Trick so, program can be run as normal user, will just use "sudo" - # call as root_check_run_as_sudo "$@" - if [ "$EUID" -ne 0 ]; then - if [ -x $0 ]; then # Directly executable use sudo - echo "# (Not root, running with sudo)" >&2 - sudo "$0" "$@" - exit $? - fi - echo "cannot perform sudo run of $0" - exit 1 - fi -} - -## -- General shell logging cmds -- -function err() { - local exitcode=$1 - shift - echo -e "ERROR: $@" >&2 - exit $exitcode -} - -function warn() { - echo -e "WARN : $@" >&2 -} - -function info() { - if [[ -n "$VERBOSE" ]]; then - echo "# $@" - fi -} - -## -- Wrapper calls for TC -- -function _call_tc() { - local allow_fail="$1" - shift - if [[ -n "$VERBOSE" ]]; then - echo "tc $@" - fi - if [[ -n "$DRYRUN" ]]; then - return - fi - $TC "$@" - local status=$? - if (( $status != 0 )); then - if [[ "$allow_fail" == "" ]]; then - err 3 "Exec error($status) occurred cmd: \"$TC $@\"" - fi - fi -} -function call_tc() { - _call_tc "" "$@" -} -function call_tc_allow_fail() { - _call_tc "allow_fail" "$@" -} diff --git a/pping/parameters.sh b/pping/parameters.sh deleted file mode 100644 index 1a1a49a7..00000000 --- a/pping/parameters.sh +++ /dev/null @@ -1,100 +0,0 @@ -# -# Common parameter parsing used by scripts in this directory -# - Depending on bash 3 (or higher) syntax -# -# Author: Jesper Dangaaard Brouer -# License: GPLv2 -# -# Modified by Simon Sundberg to add support -# of optional section (--sec) option or attaching a pinned program -# - -function usage() { - echo "" - echo "Usage: $0 [-vh] --dev ethX" - echo " -d | --dev : (\$DEV) Interface/device (required)" - echo " -v | --verbose : (\$VERBOSE) verbose" - echo " --remove : (\$REMOVE) Remove the rules" - echo " --dry-run : (\$DRYRUN) Dry-run only (echo tc commands)" - echo " -s | --stats : (\$STATS_ONLY) Call statistics command" - echo " -l | --list : (\$LIST) List setup after setup" - echo " --file | --obj : (\$BPF_OBJ) BPF-object file to load" - echo " --sec : (\$SEC) Section of BPF-object to load" - echo " --pinned : (\$PIN_PROG) Path to pinned program to attach" - echo "" -} - -# Using external program "getopt" to get --long-options -OPTIONS=$(getopt -o vshd:l \ - --long verbose,dry-run,remove,stats,list,help,dev:,file:,obj:,sec:,pinned: -- "$@") -if (( $? != 0 )); then - usage - err 2 "Error calling getopt" -fi -eval set -- "$OPTIONS" - -## --- Parse command line arguments / parameters --- -while true; do - case "$1" in - -d | --dev ) # device - export DEV=$2 - info "Device set to: DEV=$DEV" >&2 - shift 2 - ;; - --file | --obj ) - export BPF_OBJ=$2 - info "BPF-object file: $BPF_OBJ" >&2 - shift 2 - ;; - --sec ) - export SEC=$2 - info "Section to load: $SEC" >&2 - shift 2 - ;; - --pinned ) - export PIN_PROG=$2 - info "Pinned program path: $PIN_PROG" >&2 - shift 2 - ;; - -v | --verbose) - export VERBOSE=yes - # info "Verbose mode: VERBOSE=$VERBOSE" >&2 - shift - ;; - --dry-run ) - export DRYRUN=yes - export VERBOSE=yes - info "Dry-run mode: enable VERBOSE and don't call TC" >&2 - shift - ;; - --remove ) - export REMOVE=yes - shift - ;; - -s | --stats ) - export STATS_ONLY=yes - shift - ;; - -l | --list ) - export LIST=yes - shift - ;; - -- ) - shift - break - ;; - -h | --help ) - usage; - exit 0 - ;; - * ) - shift - break - ;; - esac -done - -if [ -z "$DEV" ]; then - usage - err 2 "Please specify net_device (\$DEV)" -fi diff --git a/pping/pping.c b/pping/pping.c index b8650e87..a92677b8 100644 --- a/pping/pping.c +++ b/pping/pping.c @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ static const char *__doc__ = - "Passive Ping - monitor flow RTT based on TCP timestamps"; + "Passive Ping - monitor flow RTT based on header inspection"; #include #include @@ -15,11 +15,8 @@ static const char *__doc__ = #include #include #include -#include #include // For detecting Ctrl-C #include // For setting rlmit -#include -#include #include #include @@ -29,8 +26,6 @@ static const char *__doc__ = #define NS_PER_SECOND 1000000000UL #define NS_PER_MS 1000000UL -#define TCBPF_LOADER_SCRIPT "./bpf_egress_loader.sh" - #define TIMESTAMP_LIFETIME \ (10 * NS_PER_SECOND) // Clear out packet timestamps if they're over 10 seconds #define FLOW_LIFETIME \ @@ -45,16 +40,16 @@ static const char *__doc__ = (1 * NS_PER_SECOND) // Update offset between CLOCK_MONOTONIC and CLOCK_REALTIME once per second /* - * BPF implementation of pping using libbpf - * Uses TC-BPF for egress and XDP for ingress - * - On egrees, packets are parsed for TCP TSval, - * if found added to hashmap using flow+TSval as key, - * and current time as value - * - On ingress, packets are parsed for TCP TSecr, - * if found looksup hashmap using reverse-flow+TSecr as key, - * and calculates RTT as different between now map value - * - Calculated RTTs are pushed to userspace - * (together with the related flow) and printed out + * BPF implementation of pping using libbpf. + * Uses TC-BPF for egress and XDP for ingress. + * - On egrees, packets are parsed for an identifer, + * if found added to hashmap using flow+identifier as key, + * and current time as value. + * - On ingress, packets are parsed for reply identifer, + * if found looksup hashmap using reverse-flow+identifier as key, + * and calculates RTT as different between now and stored timestamp. + * - Calculated RTTs are pushed to userspace + * (together with the related flow) and printed out. */ // Structure to contain arguments for clean_map (for passing to pthread_create) @@ -67,20 +62,24 @@ struct map_cleanup_args { // Store configuration values in struct to easily pass around struct pping_config { struct bpf_config bpf_config; + struct bpf_tc_opts tc_ingress_opts; + struct bpf_tc_opts tc_egress_opts; __u64 cleanup_interval; char *object_path; char *ingress_sec; char *egress_sec; - char *pin_dir; char *packet_map; char *flow_map; char *event_map; int xdp_flags; int ifindex; + int ingress_prog_id; + int egress_prog_id; char ifname[IF_NAMESIZE]; bool json_format; bool ppviz_format; bool force; + bool created_tc_hook; }; static volatile int keep_running = 1; @@ -91,9 +90,13 @@ static const struct option long_options[] = { { "help", no_argument, NULL, 'h' }, { "interface", required_argument, NULL, 'i' }, // Name of interface to run on { "rate-limit", required_argument, NULL, 'r' }, // Sampling rate-limit in ms - { "force", no_argument, NULL, 'f' }, // Detach any existing XDP program on interface + { "rtt-rate", required_argument, NULL, 'R' }, // Sampling rate in terms of flow-RTT (ex 1 sample per RTT-interval) + { "rtt-type", required_argument, NULL, 'T' }, // What type of RTT the RTT-rate should be applied to ("min" or "smoothed"), only relevant if rtt-rate is provided + { "force", no_argument, NULL, 'f' }, // Overwrite any existing XDP program on interface, remove qdisc on cleanup { "cleanup-interval", required_argument, NULL, 'c' }, // Map cleaning interval in s { "format", required_argument, NULL, 'F' }, // Which format to output in (standard/json/ppviz) + { "ingress-hook", required_argument, NULL, 'I' }, // Use tc or XDP as ingress hook + { "localfilt-off", no_argument, NULL, 'l' }, // Disable local filtering (will start to report "internal" RTTs) { 0, 0, NULL, 0 } }; @@ -120,35 +123,50 @@ static void print_usage(char *argv[]) printf("\n"); } -static double parse_positive_double_argument(const char *str, - const char *parname) +/* + * Simple convenience wrapper around libbpf_strerror for which you don't have + * to provide a buffer. Instead uses its own static buffer and returns a pointer + * to it. + * + * This of course comes with the tradeoff that it is no longer thread safe and + * later invocations overwrite previous results. + */ +static const char *get_libbpf_strerror(int err) +{ + static char buf[200]; + libbpf_strerror(err, buf, sizeof(buf)); + return buf; +} + +static int parse_bounded_double(double *res, const char *str, double low, + double high, const char *name) { char *endptr; - double val; - val = strtod(str, &endptr); + *res = strtod(str, &endptr); if (strlen(str) != endptr - str) { - fprintf(stderr, "%s %s is not a valid number\n", parname, str); + fprintf(stderr, "%s %s is not a valid number\n", name, str); return -EINVAL; } - if (val < 0) { - fprintf(stderr, "%s must be positive\n", parname); + if (*res < low || *res > high) { + fprintf(stderr, "%s must in range [%g, %g]\n", name, low, high); return -EINVAL; } - return val; + return 0; } static int parse_arguments(int argc, char *argv[], struct pping_config *config) { int err, opt; - double rate_limit_ms, cleanup_interval_s; + double rate_limit_ms, cleanup_interval_s, rtt_rate; config->ifindex = 0; + config->bpf_config.localfilt = true; config->force = false; config->json_format = false; config->ppviz_format = false; - while ((opt = getopt_long(argc, argv, "hfi:r:c:F:", long_options, + while ((opt = getopt_long(argc, argv, "hfli:r:R:T:c:F:I:", long_options, NULL)) != -1) { switch (opt) { case 'i': @@ -163,23 +181,44 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config) err = -errno; fprintf(stderr, "Could not get index of interface %s: %s\n", - config->ifname, strerror(err)); + config->ifname, get_libbpf_strerror(err)); return err; } break; case 'r': - rate_limit_ms = parse_positive_double_argument( - optarg, "rate-limit"); - if (rate_limit_ms < 0) + err = parse_bounded_double(&rate_limit_ms, optarg, 0, + 1000000000, "rate-limit"); + if (err) return -EINVAL; config->bpf_config.rate_limit = rate_limit_ms * NS_PER_MS; break; + case 'R': + err = parse_bounded_double(&rtt_rate, optarg, 0, 10000, + "rtt-rate"); + if (err) + return -EINVAL; + config->bpf_config.rtt_rate = + DOUBLE_TO_FIXPOINT(rtt_rate); + break; + case 'T': + if (strcmp(optarg, "min") == 0) { + config->bpf_config.use_srtt = false; + } + else if (strcmp(optarg, "smoothed") == 0) { + config->bpf_config.use_srtt = true; + } else { + fprintf(stderr, + "rtt-type must be \"min\" or \"smoothed\"\n"); + return -EINVAL; + } + break; case 'c': - cleanup_interval_s = parse_positive_double_argument( - optarg, "cleanup-interval"); - if (cleanup_interval_s < 0) + err = parse_bounded_double(&cleanup_interval_s, optarg, + 0, 1000000000, + "cleanup-interval"); + if (err) return -EINVAL; config->cleanup_interval = @@ -195,8 +234,22 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config) return -EINVAL; } break; + case 'I': + if (strcmp(optarg, "xdp") == 0) { + config->ingress_sec = SEC_INGRESS_XDP; + } else if (strcmp(optarg, "tc") == 0) { + config->ingress_sec = SEC_INGRESS_TC; + } else { + fprintf(stderr, "ingress-hook must be \"xdp\" or \"tc\"\n"); + return -EINVAL; + } + break; + case 'l': + config->bpf_config.localfilt = false; + break; case 'f': config->force = true; + config->xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; break; case 'h': printf("HELP:\n"); @@ -232,60 +285,29 @@ static int set_rlimit(long int lim) return !setrlimit(RLIMIT_MEMLOCK, &rlim) ? 0 : -errno; } -static int -bpf_obj_run_prog_pindir_func(struct bpf_object *obj, const char *prog_title, - const char *pin_dir, - int (*func)(struct bpf_program *, const char *)) +static int init_rodata(struct bpf_object *obj, void *src, size_t size) { - int len; - struct bpf_program *prog; - char path[MAX_PATH_LEN]; - - len = snprintf(path, MAX_PATH_LEN, "%s/%s", pin_dir, prog_title); - if (len < 0) - return len; - if (len > MAX_PATH_LEN) - return -ENAMETOOLONG; - - prog = bpf_object__find_program_by_title(obj, prog_title); - if (!prog || libbpf_get_error(prog)) - return prog ? libbpf_get_error(prog) : -EINVAL; - - return func(prog, path); -} + struct bpf_map *map = NULL; + bpf_object__for_each_map(map, obj) { + if (strstr(bpf_map__name(map), ".rodata")) + return bpf_map__set_initial_value(map, src, size); + } -/* - * Similar to bpf_object__pin_programs, but only attemps to pin a - * single program prog_title at path pin_dir/prog_title - */ -static int bpf_obj_pin_program(struct bpf_object *obj, const char *prog_title, - const char *pin_dir) -{ - return bpf_obj_run_prog_pindir_func(obj, prog_title, pin_dir, - bpf_program__pin); + // No .rodata map found + return -EINVAL; } /* - * Similar to bpf_object__unpin_programs, but only attempts to unpin a - * single program prog_title at path pin_dir/prog_title. + * Attempt to attach program in section sec of obj to ifindex. + * If sucessful, will return the positive program id of the attached. + * On failure, will return a negative error code. */ -static int bpf_obj_unpin_program(struct bpf_object *obj, const char *prog_title, - const char *pin_dir) -{ - return bpf_obj_run_prog_pindir_func(obj, prog_title, pin_dir, - bpf_program__unpin); -} - -static int xdp_detach(int ifindex, __u32 xdp_flags) -{ - return bpf_set_link_xdp_fd(ifindex, -1, xdp_flags); -} - static int xdp_attach(struct bpf_object *obj, const char *sec, int ifindex, - __u32 xdp_flags, bool force) + __u32 xdp_flags) { struct bpf_program *prog; - int prog_fd; + int prog_fd, err; + __u32 prog_id; if (sec) prog = bpf_object__find_program_by_title(obj, sec); @@ -296,66 +318,131 @@ static int xdp_attach(struct bpf_object *obj, const char *sec, int ifindex, if (prog_fd < 0) return prog_fd; - if (force) // detach current (if any) xdp-program first - xdp_detach(ifindex, xdp_flags); - - return bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags); -} + err = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags); + if (err) + return err; -static int init_rodata(struct bpf_object *obj, void *src, size_t size) -{ - struct bpf_map *map = NULL; - bpf_object__for_each_map(map, obj) { - if (strstr(bpf_map__name(map), ".rodata")) - return bpf_map__set_initial_value(map, src, size); + err = bpf_get_link_xdp_id(ifindex, &prog_id, xdp_flags); + if (err) { + bpf_set_link_xdp_fd(ifindex, -1, xdp_flags); + return err; } - // No .rodata map found - return -EINVAL; + return prog_id; } -static int run_external_program(const char *path, char *const argv[]) +static int xdp_detach(int ifindex, __u32 xdp_flags, __u32 expected_prog_id) { - int status; - int ret = -1; + __u32 curr_prog_id; + int err; - pid_t pid = fork(); + err = bpf_get_link_xdp_id(ifindex, &curr_prog_id, xdp_flags); + if (err) + return err; - if (pid < 0) - return -errno; - if (pid == 0) { - execv(path, argv); - return -errno; - } else { //pid > 0 - waitpid(pid, &status, 0); - if (WIFEXITED(status)) - ret = WEXITSTATUS(status); - return ret; + if (!curr_prog_id) { + return 0; // No current prog on interface } + + if (expected_prog_id && curr_prog_id != expected_prog_id) + return -ENOENT; + + return bpf_set_link_xdp_fd(ifindex, -1, xdp_flags); } -static int tc_bpf_attach(const char *pin_dir, const char *section, - char *interface) +/* + * Will attempt to attach program at section sec in obj to ifindex at + * attach_point. + * On success, will fill in the passed opts, optionally set new_hook depending + * if it created a new hook or not, and return the id of the attached program. + * On failure it will return a negative error code. + */ +static int tc_attach(struct bpf_object *obj, int ifindex, + enum bpf_tc_attach_point attach_point, + const char *sec, struct bpf_tc_opts *opts, + bool *new_hook) { - char prog_path[MAX_PATH_LEN]; - char *const argv[] = { TCBPF_LOADER_SCRIPT, "--dev", interface, - "--pinned", prog_path, NULL }; + int err; + int prog_fd; + bool created_hook = true; + DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = ifindex, + .attach_point = attach_point); + + err = bpf_tc_hook_create(&hook); + if (err == -EEXIST) + created_hook = false; + else if (err) + return err; - if (snprintf(prog_path, sizeof(prog_path), "%s/%s", pin_dir, section) < 0) - return -EINVAL; + prog_fd = bpf_program__fd( + bpf_object__find_program_by_title(obj, sec)); + if (prog_fd < 0) { + err = prog_fd; + goto err_after_hook; + } - return run_external_program(TCBPF_LOADER_SCRIPT, argv); + opts->prog_fd = prog_fd; + opts->prog_id = 0; + err = bpf_tc_attach(&hook, opts); + if (err) + goto err_after_hook; + + if (new_hook) + *new_hook = created_hook; + return opts->prog_id; + +err_after_hook: + /* + * Destroy hook if it created it. + * This is slightly racy, as some other program may still have been + * attached to the hook between its creation and this error cleanup. + */ + if (created_hook) { + hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS; + bpf_tc_hook_destroy(&hook); + } + return err; } -static int tc_bpf_clear(char *interface) +static int tc_detach(int ifindex, enum bpf_tc_attach_point attach_point, + const struct bpf_tc_opts *opts, bool destroy_hook) { - char *const argv[] = { TCBPF_LOADER_SCRIPT, "--dev", interface, - "--remove", NULL }; - return run_external_program(TCBPF_LOADER_SCRIPT, argv); + int err; + int hook_err = 0; + DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = ifindex, + .attach_point = attach_point); + DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts_info, .handle = opts->handle, + .priority = opts->priority); + + // Check we are removing the correct program + err = bpf_tc_query(&hook, &opts_info); + if (err) + return err; + if (opts->prog_id != opts_info.prog_id) + return -ENOENT; + + // Attempt to detach program + opts_info.prog_fd = 0; + opts_info.prog_id = 0; + opts_info.flags = 0; + err = bpf_tc_detach(&hook, &opts_info); + + /* + * Attempt to destroy hook regardsless if detach succeded. + * If the hook is destroyed sucessfully, program should + * also be detached. + */ + if (destroy_hook) { + hook.attach_point = BPF_TC_INGRESS | BPF_TC_EGRESS; + hook_err = bpf_tc_hook_destroy(&hook); + } + + err = destroy_hook ? hook_err : err; + return err; } /* - * Returns time of CLOCK_MONOTONIC as nanoseconds in a single __u64. + * Returns time as nanoseconds in a single __u64. * On failure, the value 0 is returned (and errno will be set). */ static __u64 get_time_ns(clockid_t clockid) @@ -378,15 +465,16 @@ static bool packet_ts_timeout(void *key_ptr, void *val_ptr, __u64 now) static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now) { struct flow_event fe; - __u64 ts = ((struct flow_state *)val_ptr)->last_timestamp; + struct flow_state *f_state = val_ptr; - if (now > ts && now - ts > FLOW_LIFETIME) { - if (print_event_func) { + if (now > f_state->last_timestamp && + now - f_state->last_timestamp > FLOW_LIFETIME) { + if (print_event_func && f_state->has_opened) { fe.event_type = EVENT_TYPE_FLOW; fe.timestamp = now; - memcpy(&fe.flow, key_ptr, sizeof(struct network_tuple)); - fe.event_info.event = FLOW_EVENT_CLOSING; - fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT; + reverse_flow(&fe.flow, key_ptr); + fe.flow_event_type = FLOW_EVENT_CLOSING; + fe.reason = EVENT_REASON_FLOW_TIMEOUT; fe.source = EVENT_SOURCE_USERSPACE; print_event_func(NULL, 0, &fe, sizeof(fe)); } @@ -537,6 +625,7 @@ static const char *flowevent_to_str(enum flow_event_type fe) case FLOW_EVENT_OPENING: return "opening"; case FLOW_EVENT_CLOSING: + case FLOW_EVENT_CLOSING_BOTH: return "closing"; default: return "unknown"; @@ -554,8 +643,6 @@ static const char *eventreason_to_str(enum flow_event_reason er) return "first observed packet"; case EVENT_REASON_FIN: return "FIN"; - case EVENT_REASON_FIN_ACK: - return "FIN-ACK"; case EVENT_REASON_RST: return "RST"; case EVENT_REASON_FLOW_TIMEOUT: @@ -568,9 +655,9 @@ static const char *eventreason_to_str(enum flow_event_reason er) static const char *eventsource_to_str(enum flow_event_source es) { switch (es) { - case EVENT_SOURCE_EGRESS: + case EVENT_SOURCE_PKT_SRC: return "src"; - case EVENT_SOURCE_INGRESS: + case EVENT_SOURCE_PKT_DEST: return "dest"; case EVENT_SOURCE_USERSPACE: return "userspace-cleanup"; @@ -607,20 +694,21 @@ static void print_event_standard(void *ctx, int cpu, void *data, if (e->event_type == EVENT_TYPE_RTT) { print_ns_datetime(stdout, e->rtt_event.timestamp); - printf(" %llu.%06llu ms %llu.%06llu ms ", + printf(" %llu.%06llu ms %llu.%06llu ms %s ", e->rtt_event.rtt / NS_PER_MS, e->rtt_event.rtt % NS_PER_MS, e->rtt_event.min_rtt / NS_PER_MS, - e->rtt_event.min_rtt % NS_PER_MS); + e->rtt_event.min_rtt % NS_PER_MS, + proto_to_str(e->rtt_event.flow.proto)); print_flow_ppvizformat(stdout, &e->rtt_event.flow); printf("\n"); } else if (e->event_type == EVENT_TYPE_FLOW) { print_ns_datetime(stdout, e->flow_event.timestamp); - printf(" "); + printf(" %s ", proto_to_str(e->rtt_event.flow.proto)); print_flow_ppvizformat(stdout, &e->flow_event.flow); printf(" %s due to %s from %s\n", - flowevent_to_str(e->flow_event.event_info.event), - eventreason_to_str(e->flow_event.event_info.reason), + flowevent_to_str(e->flow_event.flow_event_type), + eventreason_to_str(e->flow_event.reason), eventsource_to_str(e->flow_event.source)); } } @@ -630,6 +718,7 @@ static void print_event_ppviz(void *ctx, int cpu, void *data, __u32 data_size) const struct rtt_event *e = data; __u64 time = convert_monotonic_to_realtime(e->timestamp); + // ppviz format does not support flow events if (e->event_type != EVENT_TYPE_RTT) return; @@ -668,18 +757,20 @@ static void print_rttevent_fields_json(json_writer_t *ctx, jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes); jsonw_u64_field(ctx, "rec_packets", re->rec_pkts); jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes); + jsonw_bool_field(ctx, "match_on_egress", re->match_on_egress); } static void print_flowevent_fields_json(json_writer_t *ctx, const struct flow_event *fe) { jsonw_string_field(ctx, "flow_event", - flowevent_to_str(fe->event_info.event)); + flowevent_to_str(fe->flow_event_type)); jsonw_string_field(ctx, "reason", - eventreason_to_str(fe->event_info.reason)); + eventreason_to_str(fe->reason)); jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source)); } +// TODO - add field noting if RTT includes "internal" delays or not static void print_event_json(void *ctx, int cpu, void *data, __u32 data_size) { const union pping_event *e = data; @@ -706,18 +797,37 @@ static void handle_missed_rtt_event(void *ctx, int cpu, __u64 lost_cnt) fprintf(stderr, "Lost %llu RTT events on CPU %d\n", lost_cnt, cpu); } +/* + * Print out some hints for what might have caused an error while attempting + * to attach an XDP program. Based on xdp_link_attach() in + * xdp-tutorial/common/common_user_bpf_xdp.c + */ +static void print_xdp_error_hints(FILE *stream, int err) +{ + err = err > 0 ? err : -err; + switch (err) { + case EBUSY: + case EEXIST: + fprintf(stream, "Hint: XDP already loaded on device" + " use --force to swap/replace\n"); + break; + case EOPNOTSUPP: + fprintf(stream, "Hint: Native-XDP not supported\n"); + break; + } +} + static int load_attach_bpfprogs(struct bpf_object **obj, - struct pping_config *config, bool *tc_attached, - bool *xdp_attached) + struct pping_config *config) { - int err; + int err, detach_err; // Open and load ELF file *obj = bpf_object__open(config->object_path); err = libbpf_get_error(*obj); if (err) { fprintf(stderr, "Failed opening object file %s: %s\n", - config->object_path, strerror(-err)); + config->object_path, get_libbpf_strerror(err)); return err; } @@ -725,48 +835,60 @@ static int load_attach_bpfprogs(struct bpf_object **obj, sizeof(config->bpf_config)); if (err) { fprintf(stderr, "Failed pushing user-configration to %s: %s\n", - config->object_path, strerror(-err)); + config->object_path, get_libbpf_strerror(err)); return err; } err = bpf_object__load(*obj); if (err) { fprintf(stderr, "Failed loading bpf program in %s: %s\n", - config->object_path, strerror(-err)); + config->object_path, get_libbpf_strerror(err)); return err; } - // Attach tc program - err = bpf_obj_pin_program(*obj, config->egress_sec, config->pin_dir); - if (err) { - fprintf(stderr, "Failed pinning tc program to %s/%s: %s\n", - config->pin_dir, config->egress_sec, strerror(-err)); - return err; - } - - err = tc_bpf_attach(config->pin_dir, config->egress_sec, - config->ifname); - if (err) { + // Attach egress prog + config->egress_prog_id = + tc_attach(*obj, config->ifindex, BPF_TC_EGRESS, + config->egress_sec, &config->tc_egress_opts, + &config->created_tc_hook); + if (config->egress_prog_id < 0) { fprintf(stderr, - "Failed attaching tc program on interface %s: %s\n", - config->ifname, strerror(-err)); - return err; + "Failed attaching egress BPF program on interface %s: %s\n", + config->ifname, + get_libbpf_strerror(config->egress_prog_id)); + return config->egress_prog_id; } - *tc_attached = true; - // Attach XDP program - err = xdp_attach(*obj, config->ingress_sec, config->ifindex, - config->xdp_flags, config->force); - if (err) { - fprintf(stderr, "Failed attaching XDP program to %s%s: %s\n", - config->ifname, - config->force ? "" : ", ensure no other XDP program is already running on interface", - strerror(-err)); - return err; + // Attach ingress prog + if (strcmp(config->ingress_sec, SEC_INGRESS_XDP) == 0) + config->ingress_prog_id = + xdp_attach(*obj, config->ingress_sec, config->ifindex, + config->xdp_flags); + else + config->ingress_prog_id = + tc_attach(*obj, config->ifindex, BPF_TC_INGRESS, + config->ingress_sec, &config->tc_ingress_opts, + NULL); + if (config->ingress_prog_id < 0) { + fprintf(stderr, + "Failed attaching ingress BPF program on interface %s: %s\n", + config->ifname, get_libbpf_strerror(err)); + err = config->ingress_prog_id; + if (strcmp(config->ingress_sec, SEC_INGRESS_XDP) == 0) + print_xdp_error_hints(stderr, err); + goto ingress_err; } - *xdp_attached = true; return 0; + +ingress_err: + detach_err = + tc_detach(config->ifindex, BPF_TC_EGRESS, + &config->tc_egress_opts, config->created_tc_hook); + if (detach_err) + fprintf(stderr, "Failed detaching tc program from %s: %s\n", + config->ifname, get_libbpf_strerror(detach_err)); + return err; } static int setup_periodical_map_cleaning(struct bpf_object *obj, @@ -783,7 +905,7 @@ static int setup_periodical_map_cleaning(struct bpf_object *obj, if (clean_args.packet_map_fd < 0) { fprintf(stderr, "Could not get file descriptor of map %s: %s\n", config->packet_map, - strerror(-clean_args.packet_map_fd)); + get_libbpf_strerror(clean_args.packet_map_fd)); return clean_args.packet_map_fd; } @@ -791,15 +913,16 @@ static int setup_periodical_map_cleaning(struct bpf_object *obj, bpf_object__find_map_fd_by_name(obj, config->flow_map); if (clean_args.flow_map_fd < 0) { fprintf(stderr, "Could not get file descriptor of map %s: %s\n", - config->flow_map, strerror(-clean_args.flow_map_fd)); - return clean_args.packet_map_fd; + config->flow_map, + get_libbpf_strerror(clean_args.flow_map_fd)); + return clean_args.flow_map_fd; } err = pthread_create(&tid, NULL, periodic_map_cleanup, &clean_args); if (err) { fprintf(stderr, "Failed starting thread to perform periodic map cleanup: %s\n", - strerror(-err)); + get_libbpf_strerror(err)); return err; } @@ -808,30 +931,31 @@ static int setup_periodical_map_cleaning(struct bpf_object *obj, int main(int argc, char *argv[]) { - int err = 0; - - bool tc_attached = false; - bool xdp_attached = false; - + int err = 0, detach_err; struct bpf_object *obj = NULL; + struct perf_buffer *pb = NULL; + struct perf_buffer_opts pb_opts = { + .sample_cb = print_event_standard, + .lost_cb = handle_missed_rtt_event, + }; + + DECLARE_LIBBPF_OPTS(bpf_tc_opts, tc_ingress_opts); + DECLARE_LIBBPF_OPTS(bpf_tc_opts, tc_egress_opts); struct pping_config config = { - .bpf_config = { .rate_limit = 100 * NS_PER_MS }, + .bpf_config = { .rate_limit = 100 * NS_PER_MS, + .rtt_rate = 0, + .use_srtt = false }, .cleanup_interval = 1 * NS_PER_SECOND, .object_path = "pping_kern.o", - .ingress_sec = INGRESS_PROG_SEC, - .egress_sec = EGRESS_PROG_SEC, - .pin_dir = "/sys/fs/bpf/pping", + .ingress_sec = SEC_INGRESS_XDP, + .egress_sec = SEC_EGRESS_TC, .packet_map = "packet_ts", .flow_map = "flow_state", .event_map = "events", .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST, - }; - - struct perf_buffer *pb = NULL; - struct perf_buffer_opts pb_opts = { - .sample_cb = print_event_standard, - .lost_cb = handle_missed_rtt_event, + .tc_ingress_opts = tc_ingress_opts, + .tc_egress_opts = tc_egress_opts, }; print_event_func = print_event_standard; @@ -846,14 +970,14 @@ int main(int argc, char *argv[]) err = set_rlimit(RLIM_INFINITY); if (err) { fprintf(stderr, "Could not set rlimit to infinity: %s\n", - strerror(-err)); + get_libbpf_strerror(err)); return EXIT_FAILURE; } err = parse_arguments(argc, argv, &config); if (err) { fprintf(stderr, "Failed parsing arguments: %s\n", - strerror(-err)); + get_libbpf_strerror(err)); print_usage(argv); return EXIT_FAILURE; } @@ -866,19 +990,19 @@ int main(int argc, char *argv[]) print_event_func = print_event_ppviz; } - err = load_attach_bpfprogs(&obj, &config, &tc_attached, &xdp_attached); + err = load_attach_bpfprogs(&obj, &config); if (err) { fprintf(stderr, "Failed loading and attaching BPF programs in %s\n", config.object_path); - goto cleanup; + return EXIT_FAILURE; } err = setup_periodical_map_cleaning(obj, &config); if (err) { fprintf(stderr, "Failed setting up map cleaning: %s\n", - strerror(-err)); - goto cleanup; + get_libbpf_strerror(err)); + goto cleanup_attached_progs; } // Set up perf buffer @@ -887,10 +1011,9 @@ int main(int argc, char *argv[]) PERF_BUFFER_PAGES, &pb_opts); err = libbpf_get_error(pb); if (err) { - pb = NULL; fprintf(stderr, "Failed to open perf buffer %s: %s\n", - config.event_map, strerror(err)); - goto cleanup; + config.event_map, get_libbpf_strerror(err)); + goto cleanup_attached_progs; } // Allow program to perform cleanup on Ctrl-C @@ -902,43 +1025,38 @@ int main(int argc, char *argv[]) if (keep_running) // Only print polling error if it wasn't caused by program termination fprintf(stderr, "Error polling perf buffer: %s\n", - strerror(-err)); + get_libbpf_strerror(-err)); break; } } -cleanup: - perf_buffer__free(pb); - - if (xdp_attached) { - err = xdp_detach(config.ifindex, config.xdp_flags); - if (err) - fprintf(stderr, - "Failed deatching program from ifindex %s: %s\n", - config.ifname, strerror(-err)); - } - - if (tc_attached) { - err = tc_bpf_clear(config.ifname); - if (err) - fprintf(stderr, - "Failed removing tc-bpf program from interface %s: %s\n", - config.ifname, strerror(-err)); - } - - if (obj && !libbpf_get_error(obj)) { - err = bpf_obj_unpin_program(obj, config.egress_sec, - config.pin_dir); - if (err) - fprintf(stderr, - "Failed unpinning tc program from %s: %s\n", - config.pin_dir, strerror(-err)); - } - + // Cleanup if (config.json_format && json_ctx) { jsonw_end_array(json_ctx); jsonw_destroy(&json_ctx); } - return err != 0; + perf_buffer__free(pb); + +cleanup_attached_progs: + if (strcmp(config.ingress_sec, SEC_INGRESS_XDP) == 0) + detach_err = xdp_detach(config.ifindex, config.xdp_flags, + config.ingress_prog_id); + else + detach_err = tc_detach(config.ifindex, BPF_TC_INGRESS, + &config.tc_ingress_opts, false); + if (detach_err) + fprintf(stderr, + "Failed removing ingress program from interface %s: %s\n", + config.ifname, get_libbpf_strerror(detach_err)); + + detach_err = + tc_detach(config.ifindex, BPF_TC_EGRESS, &config.tc_egress_opts, + config.force && config.created_tc_hook); + if (detach_err) + fprintf(stderr, + "Failed removing egress program from interface %s: %s\n", + config.ifname, get_libbpf_strerror(detach_err)); + + return (err != 0 && keep_running) || detach_err != 0; } diff --git a/pping/pping.h b/pping/pping.h index 622570f6..3703e8c1 100644 --- a/pping/pping.h +++ b/pping/pping.h @@ -6,8 +6,14 @@ #include #include -#define INGRESS_PROG_SEC "xdp" -#define EGRESS_PROG_SEC "classifier" +#define SEC_INGRESS_XDP "xdp" +#define SEC_INGRESS_TC "classifier/ingress" +#define SEC_EGRESS_TC "classifier/egress" + +typedef __u64 fixpoint64; +#define FIXPOINT_SHIFT 16 +#define DOUBLE_TO_FIXPOINT(X) ((fixpoint64)((X) * (1UL << FIXPOINT_SHIFT))) +#define FIXPOINT_TO_UINT(X) ((X) >> FIXPOINT_SHIFT) /* For the event_type members of rtt_event and flow_event */ #define EVENT_TYPE_FLOW 1 @@ -16,7 +22,8 @@ enum __attribute__((__packed__)) flow_event_type { FLOW_EVENT_NONE, FLOW_EVENT_OPENING, - FLOW_EVENT_CLOSING + FLOW_EVENT_CLOSING, + FLOW_EVENT_CLOSING_BOTH }; enum __attribute__((__packed__)) flow_event_reason { @@ -24,19 +31,22 @@ enum __attribute__((__packed__)) flow_event_reason { EVENT_REASON_SYN_ACK, EVENT_REASON_FIRST_OBS_PCKT, EVENT_REASON_FIN, - EVENT_REASON_FIN_ACK, EVENT_REASON_RST, EVENT_REASON_FLOW_TIMEOUT }; enum __attribute__((__packed__)) flow_event_source { - EVENT_SOURCE_EGRESS, - EVENT_SOURCE_INGRESS, + EVENT_SOURCE_PKT_SRC, + EVENT_SOURCE_PKT_DEST, EVENT_SOURCE_USERSPACE }; struct bpf_config { __u64 rate_limit; + fixpoint64 rtt_rate; + bool use_srtt; + bool localfilt; + __u8 reserved[6]; }; /* @@ -67,13 +77,16 @@ struct network_tuple { struct flow_state { __u64 min_rtt; + __u64 srtt; __u64 last_timestamp; __u64 sent_pkts; __u64 sent_bytes; __u64 rec_pkts; __u64 rec_bytes; __u32 last_id; - __u32 reserved; + bool has_opened; + enum flow_event_reason opening_reason; + __u16 reserved; }; struct packet_id { @@ -100,13 +113,14 @@ struct rtt_event { __u64 sent_bytes; __u64 rec_pkts; __u64 rec_bytes; - __u32 reserved; + bool match_on_egress; + __u8 reserved[7]; }; -struct flow_event_info { - enum flow_event_type event; - enum flow_event_reason reason; -}; +/* struct flow_event_info { */ +/* enum flow_event_type event; */ +/* enum flow_event_reason reason; */ +/* }; */ /* * A flow event message that can be passed from the bpf-programs to user-space. @@ -118,7 +132,8 @@ struct flow_event { __u64 event_type; __u64 timestamp; struct network_tuple flow; - struct flow_event_info event_info; + enum flow_event_type flow_event_type; + enum flow_event_reason reason; enum flow_event_source source; __u8 reserved; }; @@ -129,4 +144,16 @@ union pping_event { struct flow_event flow_event; }; +/* + * Copies the src to dest, but swapping place on saddr and daddr + */ +static void reverse_flow(struct network_tuple *dest, struct network_tuple *src) +{ + dest->ipv = src->ipv; + dest->proto = src->proto; + dest->saddr = src->daddr; + dest->daddr = src->saddr; + dest->reserved = 0; +} + #endif diff --git a/pping/pping_kern.c b/pping/pping_kern.c index 34a56ddd..9cf2d2bf 100644 --- a/pping/pping_kern.c +++ b/pping/pping_kern.c @@ -1,12 +1,15 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ #include #include +#include #include #include #include #include #include #include +#include +#include #include // overwrite xdp/parsing_helpers.h value to avoid hitting verifier limit @@ -22,6 +25,8 @@ #define AF_INET6 10 #define MAX_TCP_OPTIONS 10 +#define IPV6_FLOWINFO_MASK __cpu_to_be32(0x0FFFFFFF) + /* * This struct keeps track of the data and data_end pointers from the xdp_md or * __skb_buff contexts, as well as a currently parsed to position kept in nh. @@ -30,11 +35,43 @@ * header encloses. */ struct parsing_context { - void *data; //Start of eth hdr - void *data_end; //End of safe acessible area - struct hdr_cursor nh; //Position to parse next - __u32 pkt_len; //Full packet length (headers+data) - bool is_egress; //Is packet on egress or ingress? + void *data; // Start of eth hdr + void *data_end; // End of safe acessible area + struct hdr_cursor nh; // Position to parse next + __u32 pkt_len; // Full packet length (headers+data) + __u32 ifindex; // Interface packet arrived on + bool is_egress; // Is packet on egress or ingress? +}; + +/* + * Struct filled in by parse_packet_id. + * + * Note: As long as parse_packet_id is successful, the flow-parts of pid + * and reply_pid should be valid, regardless of value for pid_valid and + * reply_pid valid. The *pid_valid members are there to indicate that the + * identifier part of *pid are valid and can be used for timestamping/lookup. + * The reason for not keeping the flow parts as an entirely separate members + * is to save some performance by avoid doing a copy for lookup/insertion + * in the packet_ts map. + */ +struct packet_info { + union { + struct iphdr *iph; + struct ipv6hdr *ip6h; + }; + union { + struct icmphdr *icmph; + struct icmp6hdr *icmp6h; + struct tcphdr *tcph; + }; + __u64 time; // Arrival time of packet + __u32 payload; // Size of packet data (excluding headers) + struct packet_id pid; // identifier to timestamp (ex. TSval) + struct packet_id reply_pid; // identifier to match against (ex. TSecr) + bool pid_valid; // identifier can be used to timestamp packet + bool reply_pid_valid; // reply_identifier can be used to match packet + enum flow_event_type event_type; // flow event triggered by packet + enum flow_event_reason event_reason; // reason for triggering flow event }; char _license[] SEC("license") = "GPL"; @@ -67,13 +104,24 @@ struct { /* * Maps an IPv4 address into an IPv6 address according to RFC 4291 sec 2.5.5.2 */ -static void map_ipv4_to_ipv6(__be32 ipv4, struct in6_addr *ipv6) +static void map_ipv4_to_ipv6(struct in6_addr *ipv6, __be32 ipv4) { __builtin_memset(&ipv6->in6_u.u6_addr8[0], 0x00, 10); __builtin_memset(&ipv6->in6_u.u6_addr8[10], 0xff, 2); ipv6->in6_u.u6_addr32[3] = ipv4; } +/* + * Returns the number of unparsed bytes left in the packet (bytes after nh.pos) + */ +static __u32 remaining_pkt_payload(struct parsing_context *ctx) +{ + // pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer) + // data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register + __u32 parsed_bytes = ctx->nh.pos - ctx->data; + return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0; +} + /* * Parses the TSval and TSecr values from the TCP options field. If sucessful * the TSval and TSecr values will be stored at tsval and tsecr (in network @@ -131,209 +179,415 @@ static int parse_tcp_ts(struct tcphdr *tcph, void *data_end, __u32 *tsval, /* * Attempts to fetch an identifier for TCP packets, based on the TCP timestamp * option. - * If successful, identifier will be set to TSval if is_ingress, or TSecr - * otherwise, the port-members of saddr and daddr will be set to the TCP source - * and dest, respectively, fei will be filled appropriately (based on - * SYN/FIN/RST) and 0 will be returned. - * On failure, -1 will be returned. + * + * Will use the TSval as pid and TSecr as reply_pid, and the TCP source and dest + * as port numbers. + * + * If successful, the pid (identifer + flow.port), reply_pid, pid_valid, + * reply_pid_valid, event_type and event_reason members of p_info will be set + * appropriately and 0 will be returned. + * On failure -1 will be returned (no guarantees on values set in p_info). */ -static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport, - __be16 *dport, struct flow_event_info *fei, - __u32 *identifier) +static int parse_tcp_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { - __u32 tsval, tsecr; - struct tcphdr *tcph; + if (parse_tcphdr(&pctx->nh, pctx->data_end, &p_info->tcph) < 0) + return -1; + + if (parse_tcp_ts(p_info->tcph, pctx->data_end, &p_info->pid.identifier, + &p_info->reply_pid.identifier) < 0) + return -1; //Possible TODO, fall back on seq/ack instead + + p_info->pid.flow.saddr.port = p_info->tcph->source; + p_info->pid.flow.daddr.port = p_info->tcph->dest; + + // Do not timestamp pure ACKs (no payload) + p_info->pid_valid = + pctx->nh.pos - pctx->data < pctx->pkt_len || p_info->tcph->syn; + + // Do not match on non-ACKs (TSecr not valid) + p_info->reply_pid_valid = p_info->tcph->ack; - if (parse_tcphdr(&ctx->nh, ctx->data_end, &tcph) < 0) + // Check if connection is opening/closing + if (p_info->tcph->rst) { + p_info->event_type = FLOW_EVENT_CLOSING_BOTH; + p_info->event_reason = EVENT_REASON_RST; + } else if (p_info->tcph->fin) { + p_info->event_type = FLOW_EVENT_CLOSING; + p_info->event_reason = EVENT_REASON_FIN; + } else if (p_info->tcph->syn) { + p_info->event_type = FLOW_EVENT_OPENING; + p_info->event_reason = p_info->tcph->ack ? + EVENT_REASON_SYN_ACK : + EVENT_REASON_SYN; + } else { + p_info->event_type = FLOW_EVENT_NONE; + } + + return 0; +} + +/* + * Attempts to fetch an identifier for an ICMPv6 header, based on the echo + * request/reply sequence number. + * + * Will use the echo sequence number as pid/reply_pid and the echo identifier + * as port numbers. Echo requests will only generate a valid pid and echo + * replies will only generate a valid reply_pid. + * + * If successful, the pid (identifier + flow.port), reply_pid, pid_valid, + * reply pid_valid and event_type of p_info will be set appropriately and 0 + * will be returned. + * On failure, -1 will be returned (no guarantees on p_info members). + * + * Note: Will store the 16-bit sequence number in network byte order + * in the 32-bit (reply_)pid.identifier. + */ +static int parse_icmp6_identifier(struct parsing_context *pctx, + struct packet_info *p_info) +{ + if (parse_icmp6hdr(&pctx->nh, pctx->data_end, &p_info->icmp6h) < 0) return -1; - // Do not timestamp pure ACKs - if (ctx->is_egress && ctx->nh.pos - ctx->data >= ctx->pkt_len && - !tcph->syn) + if (p_info->icmp6h->icmp6_code != 0) return -1; - // Check if connection is opening/closing - if (tcph->syn) { - fei->event = FLOW_EVENT_OPENING; - fei->reason = - tcph->ack ? EVENT_REASON_SYN_ACK : EVENT_REASON_SYN; - } else if (tcph->rst) { - fei->event = FLOW_EVENT_CLOSING; - fei->reason = EVENT_REASON_RST; - } else if (!ctx->is_egress && tcph->fin) { - fei->event = FLOW_EVENT_CLOSING; - fei->reason = - tcph->ack ? EVENT_REASON_FIN_ACK : EVENT_REASON_FIN; + if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REQUEST) { + p_info->pid.identifier = p_info->icmp6h->icmp6_sequence; + p_info->pid_valid = true; + p_info->reply_pid_valid = false; + } else if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REPLY) { + p_info->reply_pid.identifier = p_info->icmp6h->icmp6_sequence; + p_info->reply_pid_valid = true; + p_info->pid_valid = false; } else { - fei->event = FLOW_EVENT_NONE; + return -1; } - if (parse_tcp_ts(tcph, ctx->data_end, &tsval, &tsecr) < 0) - return -1; //Possible TODO, fall back on seq/ack instead + p_info->event_type = FLOW_EVENT_NONE; + p_info->pid.flow.saddr.port = p_info->icmp6h->icmp6_identifier; + p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port; + return 0; +} + +/* + * Same as parse_icmp6_identifier, but for an ICMP(v4) header instead. + */ +static int parse_icmp_identifier(struct parsing_context *pctx, + struct packet_info *p_info) +{ + if (parse_icmphdr(&pctx->nh, pctx->data_end, &p_info->icmph) < 0) + return -1; + + if (p_info->icmph->code != 0) + return -1; - *sport = tcph->source; - *dport = tcph->dest; - *identifier = ctx->is_egress ? tsval : tsecr; + if (p_info->icmph->type == ICMP_ECHO) { + p_info->pid.identifier = p_info->icmph->un.echo.sequence; + p_info->pid_valid = true; + p_info->reply_pid_valid = false; + } else if (p_info->icmph->type == ICMP_ECHOREPLY) { + p_info->reply_pid.identifier = p_info->icmph->un.echo.sequence; + p_info->reply_pid_valid = true; + p_info->pid_valid = false; + } else { + return -1; + } + + p_info->event_type = FLOW_EVENT_NONE; + p_info->pid.flow.saddr.port = p_info->icmph->un.echo.id; + p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port; return 0; } /* - * Attempts to parse the packet limited by the data and data_end pointers, - * to retrieve a protocol dependent packet identifier. If sucessful, the - * pointed to p_id and fei will be filled with parsed information from the - * packet, and 0 will be returned. On failure, -1 will be returned. - * If is_egress saddr and daddr will match source and destination of packet, - * respectively, and identifier will be set to the identifer for an outgoing - * packet. Otherwise, saddr and daddr will be swapped (will match - * destination and source of packet, respectively), and identifier will be - * set to the identifier of a response. + * Attempts to parse the packet defined by pctx for a valid packet identifier + * and reply identifier, filling in p_info. + * + * If succesful, all members of p_info will be set appropriately and 0 will + * be returned. + * On failure -1 will be returned (no garantuees on p_info members). */ -static int parse_packet_identifier(struct parsing_context *ctx, - struct packet_id *p_id, - struct flow_event_info *fei) +static int parse_packet_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { int proto, err; struct ethhdr *eth; - struct iphdr *iph; - struct ipv6hdr *ip6h; - struct flow_address *saddr, *daddr; - - // Switch saddr <--> daddr on ingress to match egress - if (ctx->is_egress) { - saddr = &p_id->flow.saddr; - daddr = &p_id->flow.daddr; - } else { - saddr = &p_id->flow.daddr; - daddr = &p_id->flow.saddr; - } - proto = parse_ethhdr(&ctx->nh, ctx->data_end, ð); + p_info->time = bpf_ktime_get_ns(); + proto = parse_ethhdr(&pctx->nh, pctx->data_end, ð); // Parse IPv4/6 header if (proto == bpf_htons(ETH_P_IP)) { - p_id->flow.ipv = AF_INET; - p_id->flow.proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph); + p_info->pid.flow.ipv = AF_INET; + p_info->pid.flow.proto = + parse_iphdr(&pctx->nh, pctx->data_end, &p_info->iph); } else if (proto == bpf_htons(ETH_P_IPV6)) { - p_id->flow.ipv = AF_INET6; - p_id->flow.proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h); + p_info->pid.flow.ipv = AF_INET6; + p_info->pid.flow.proto = + parse_ip6hdr(&pctx->nh, pctx->data_end, &p_info->ip6h); } else { return -1; } - // Add new protocols here - if (p_id->flow.proto == IPPROTO_TCP) { - err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port, - fei, &p_id->identifier); - if (err) - return -1; - } else { - return -1; - } + // Parse identifer from suitable protocol + if (p_info->pid.flow.proto == IPPROTO_TCP) + err = parse_tcp_identifier(pctx, p_info); + else if (p_info->pid.flow.proto == IPPROTO_ICMPV6 && + p_info->pid.flow.ipv == AF_INET6) + err = parse_icmp6_identifier(pctx, p_info); + else if (p_info->pid.flow.proto == IPPROTO_ICMP && + p_info->pid.flow.ipv == AF_INET) + err = parse_icmp_identifier(pctx, p_info); + else + return -1; // No matching protocol + if (err) + return -1; // Failed parsing protocol // Sucessfully parsed packet identifier - fill in IP-addresses and return - if (p_id->flow.ipv == AF_INET) { - map_ipv4_to_ipv6(iph->saddr, &saddr->ip); - map_ipv4_to_ipv6(iph->daddr, &daddr->ip); + if (p_info->pid.flow.ipv == AF_INET) { + map_ipv4_to_ipv6(&p_info->pid.flow.saddr.ip, + p_info->iph->saddr); + map_ipv4_to_ipv6(&p_info->pid.flow.daddr.ip, + p_info->iph->daddr); } else { // IPv6 - saddr->ip = ip6h->saddr; - daddr->ip = ip6h->daddr; + p_info->pid.flow.saddr.ip = p_info->ip6h->saddr; + p_info->pid.flow.daddr.ip = p_info->ip6h->daddr; } + + reverse_flow(&p_info->reply_pid.flow, &p_info->pid.flow); + p_info->payload = remaining_pkt_payload(pctx); + return 0; } /* - * Returns the number of unparsed bytes left in the packet (bytes after nh.pos) + * Calculate a smooted rtt similar to how TCP stack does it in + * net/ipv4/tcp_input.c/tcp_rtt_estimator(). + * + * NOTE: Will cause roundoff errors, but if RTTs > 1000ns errors should be small */ -static __u32 remaining_pkt_payload(struct parsing_context *ctx) +static __u64 calculate_srtt(__u64 prev_srtt, __u64 rtt) { - // pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer) - // data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register - __u32 parsed_bytes = ctx->nh.pos - ctx->data; - return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0; + if (!prev_srtt) + return rtt; + // srtt = 7/8*prev_srtt + 1/8*rtt + return prev_srtt - (prev_srtt >> 3) + (rtt >> 3); +} + +/* + * Return true if flow should wait with timestamping due to rate limit + */ +static bool is_rate_limited(__u64 now, __u64 last_ts, __u64 rtt) +{ + if (now < last_ts) + return true; + + // RTT-based rate limit + if (config.rtt_rate && rtt) + return now - last_ts < FIXPOINT_TO_UINT(config.rtt_rate * rtt); + + // Static rate limit + return now - last_ts < config.rate_limit; } /* - * Fills in event_type, timestamp, flow, source and reserved. - * Does not fill in the flow_info. + * Fills in the members of flow_event based on the passed values. + * + * If rev_flow is true, will report src/dest as in p_info.reply_pid.flow and set + * reason as packet-dest. Otherwise, will report src/dest as in p_info.pid.flow + * and reason as packet-src. */ -static void fill_flow_event(struct flow_event *fe, __u64 timestamp, - struct network_tuple *flow, - enum flow_event_source source) +static void fill_flow_event(struct flow_event *fe, struct packet_info *p_info, + bool rev_flow) { fe->event_type = EVENT_TYPE_FLOW; - fe->timestamp = timestamp; - __builtin_memcpy(&fe->flow, flow, sizeof(struct network_tuple)); - fe->source = source; + fe->flow_event_type = p_info->event_type; + fe->reason = p_info->event_reason; + fe->timestamp = p_info->time; fe->reserved = 0; // Make sure it's initilized -} -// Programs + if (rev_flow) { + fe->flow = p_info->reply_pid.flow; + fe->source = EVENT_SOURCE_PKT_DEST; + } else { + fe->flow = p_info->pid.flow; + fe->source = EVENT_SOURCE_PKT_SRC; + } +} -// TC-BFP for parsing packet identifier from egress traffic and add to map -SEC(EGRESS_PROG_SEC) -int pping_egress(struct __sk_buff *skb) +/* + * Attempt to create a new flow-state. + * Returns a pointer to the flow_state if successful, NULL otherwise + */ +static struct flow_state *create_flow(struct packet_info *p_info) { - struct packet_id p_id = { 0 }; - struct flow_event fe; - __u64 now; - struct parsing_context pctx = { - .data = (void *)(long)skb->data, - .data_end = (void *)(long)skb->data_end, - .pkt_len = skb->len, - .nh = { .pos = pctx.data }, - .is_egress = true, - }; - struct flow_state *f_state; struct flow_state new_state = { 0 }; + new_state.last_timestamp = p_info->time; + new_state.opening_reason = p_info->event_type == FLOW_EVENT_OPENING ? + p_info->event_reason : + EVENT_REASON_FIRST_OBS_PCKT; - if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0) - goto out; + if (bpf_map_update_elem(&flow_state, &p_info->pid.flow, &new_state, + BPF_NOEXIST) != 0) + return NULL; - now = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); + return bpf_map_lookup_elem(&flow_state, &p_info->pid.flow); +} - // Flow closing - try to delete flow state and push closing-event - if (fe.event_info.event == FLOW_EVENT_CLOSING) { - if (!f_state) { - bpf_map_delete_elem(&flow_state, &p_id.flow); - fill_flow_event(&fe, now, &p_id.flow, - EVENT_SOURCE_EGRESS); - bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, - &fe, sizeof(fe)); +static struct flow_state *update_flow(void *ctx, struct packet_info *p_info, + struct flow_event *fe, bool *new_flow) +{ + struct flow_state *f_state; + bool has_opened; + *new_flow = false; + + f_state = bpf_map_lookup_elem(&flow_state, &p_info->pid.flow); + + // Flow is closing - attempt to delete state if it exists + if (p_info->event_type == FLOW_EVENT_CLOSING || + p_info->event_type == FLOW_EVENT_CLOSING_BOTH) { + if (!f_state) + return NULL; + + has_opened = f_state->has_opened; + if (bpf_map_delete_elem(&flow_state, &p_info->pid.flow) == 0 && + has_opened) { + fill_flow_event(fe, p_info, true); + bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, + fe, sizeof(*fe)); } - goto out; + return NULL; } - // No previous state - attempt to create it and push flow-opening event - if (!f_state) { - bpf_map_update_elem(&flow_state, &p_id.flow, &new_state, - BPF_NOEXIST); - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); + // Attempt to create flow if it does not exist + if (!f_state && p_info->pid_valid) { + *new_flow = true; + f_state = create_flow(p_info); + } + + if (!f_state) + return NULL; + + // Update flow state + f_state->sent_pkts++; + f_state->sent_bytes += p_info->payload; + + return f_state; +} + +static struct flow_state *update_rev_flow(void *ctx, struct packet_info *p_info, + struct flow_event *fe) +{ + struct flow_state *f_state; + bool has_opened; + + f_state = bpf_map_lookup_elem(&flow_state, &p_info->reply_pid.flow); - if (!f_state) // Creation failed - goto out; + if (!f_state) + return NULL; + + // Close reverse flow + if (p_info->event_type == FLOW_EVENT_CLOSING_BOTH) { - if (fe.event_info.event != FLOW_EVENT_OPENING) { - fe.event_info.event = FLOW_EVENT_OPENING; - fe.event_info.reason = EVENT_REASON_FIRST_OBS_PCKT; + has_opened = f_state->has_opened; + if (bpf_map_delete_elem(&flow_state, &p_info->reply_pid.flow) == + 0 && has_opened) { + fill_flow_event(fe, p_info, false); + bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, + fe, sizeof(*fe)); } - fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_EGRESS); - bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &fe, - sizeof(fe)); + return NULL; } - f_state->sent_pkts++; - f_state->sent_bytes += remaining_pkt_payload(&pctx); + // Is a new flow, push opening flow message + if (!f_state->has_opened) { + f_state->has_opened = true; + + fe->event_type = EVENT_TYPE_FLOW; + fe->flow = p_info->pid.flow; + fe->timestamp = f_state->last_timestamp; + fe->flow_event_type = FLOW_EVENT_OPENING; + fe->reason = f_state->opening_reason; + fe->source = EVENT_SOURCE_PKT_DEST; + fe->reserved = 0; + bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, fe, + sizeof(*fe)); + } + + // Update flow state + f_state->rec_pkts++; + f_state->rec_bytes += p_info->payload; + + return f_state; +} + +/* + * Return true if p_info->pid.flow.daddr is a "local" address. + * + * Works by performing a fib lookup for p_info->pid.flow. + * Code heavily inspired by samples/bpf/xdp_fwd_kern.c/xdp_fwd_flags(). + */ +static bool is_local_address(struct packet_info *p_info, void *ctx, + struct parsing_context *pctx) +{ + int err; + struct bpf_fib_lookup lookup = { 0 }; + __builtin_memset(&lookup, 0, sizeof(lookup)); + + if (p_info->pid.flow.ipv == AF_INET) { + lookup.family = AF_INET; + lookup.tos = p_info->iph->tos; + lookup.tot_len = bpf_ntohs(p_info->iph->tot_len); + lookup.ipv4_src = p_info->iph->saddr; + lookup.ipv4_dst = p_info->iph->daddr; + } else { + struct in6_addr *src = (struct in6_addr *)lookup.ipv6_src; + struct in6_addr *dst = (struct in6_addr *)lookup.ipv6_dst; + + lookup.family = AF_INET6; + lookup.flowinfo = *(__be32 *)p_info->ip6h & IPV6_FLOWINFO_MASK; + lookup.tot_len = bpf_ntohs(p_info->ip6h->payload_len); + *src = p_info->pid.flow.saddr.ip; //verifier did not like ip6h->saddr + *dst = p_info->pid.flow.daddr.ip; + } + + lookup.family = p_info->pid.flow.ipv; + lookup.l4_protocol = p_info->pid.flow.proto; + lookup.sport = 0; + lookup.dport = 0; + lookup.ifindex = pctx->ifindex; + + err = bpf_fib_lookup(ctx, &lookup, sizeof(lookup), 0); + + return err == BPF_FIB_LKUP_RET_NOT_FWDED || + err == BPF_FIB_LKUP_RET_FWD_DISABLED; +} + +/* + * Attempt to create a timestamp-entry for packet p_info for flow in f_state + */ +static void pping_timestamp_packet(struct flow_state *f_state, void *ctx, + struct parsing_context *pctx, + struct packet_info *p_info, bool new_flow) +{ + if (!f_state || !p_info->pid_valid) + return; + + if (config.localfilt && !pctx->is_egress && + is_local_address(p_info, ctx, pctx)) + return; // Check if identfier is new - if (f_state->last_id == p_id.identifier) - goto out; - f_state->last_id = p_id.identifier; + if (!new_flow && f_state->last_id == p_info->pid.identifier) + return; + f_state->last_id = p_info->pid.identifier; // Check rate-limit - if (now < f_state->last_timestamp || - now - f_state->last_timestamp < config.rate_limit) - goto out; + if (!new_flow && + is_rate_limited(p_info->time, f_state->last_timestamp, + config.use_srtt ? f_state->srtt : f_state->min_rtt)) + return; /* * Updates attempt at creating timestamp, even if creation of timestamp @@ -341,75 +595,123 @@ int pping_egress(struct __sk_buff *skb) * the next available map slot somewhat fairer between heavy and sparse * flows. */ - f_state->last_timestamp = now; - bpf_map_update_elem(&packet_ts, &p_id, &now, BPF_NOEXIST); + f_state->last_timestamp = p_info->time; -out: - return BPF_OK; + bpf_map_update_elem(&packet_ts, &p_info->pid, &p_info->time, + BPF_NOEXIST); } -// XDP program for parsing identifier in ingress traffic and check for match in map -SEC(INGRESS_PROG_SEC) -int pping_ingress(struct xdp_md *ctx) +/* + * Attempt to match packet in p_info with a timestamp from flow in f_state + */ +static void pping_match_packet(struct flow_state *f_state, void *ctx, + struct parsing_context *pctx, + struct packet_info *p_info) { - struct packet_id p_id = { 0 }; - __u64 *p_ts; - struct flow_event fe; struct rtt_event re = { 0 }; - struct flow_state *f_state; - struct parsing_context pctx = { - .data = (void *)(long)ctx->data, - .data_end = (void *)(long)ctx->data_end, - .pkt_len = pctx.data_end - pctx.data, - .nh = { .pos = pctx.data }, - .is_egress = false, - }; - __u64 now; - - if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0) - goto out; - - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); - if (!f_state) - goto out; - - f_state->rec_pkts++; - f_state->rec_bytes += remaining_pkt_payload(&pctx); + __u64 *p_ts; - now = bpf_ktime_get_ns(); - p_ts = bpf_map_lookup_elem(&packet_ts, &p_id); - if (!p_ts || now < *p_ts) - goto validflow_out; + if (!f_state || !p_info->reply_pid_valid) + return; - re.rtt = now - *p_ts; + p_ts = bpf_map_lookup_elem(&packet_ts, &p_info->reply_pid); + if (!p_ts || p_info->time < *p_ts) + return; + re.rtt = p_info->time - *p_ts; // Delete timestamp entry as soon as RTT is calculated - bpf_map_delete_elem(&packet_ts, &p_id); + bpf_map_delete_elem(&packet_ts, &p_info->reply_pid); if (f_state->min_rtt == 0 || re.rtt < f_state->min_rtt) f_state->min_rtt = re.rtt; + f_state->srtt = calculate_srtt(f_state->srtt, re.rtt); + // Fill event and push to perf-buffer re.event_type = EVENT_TYPE_RTT; - re.timestamp = now; + re.timestamp = p_info->time; re.min_rtt = f_state->min_rtt; re.sent_pkts = f_state->sent_pkts; re.sent_bytes = f_state->sent_bytes; re.rec_pkts = f_state->rec_pkts; re.rec_bytes = f_state->rec_bytes; - - // Push event to perf-buffer - __builtin_memcpy(&re.flow, &p_id.flow, sizeof(struct network_tuple)); + re.flow = p_info->pid.flow; + re.match_on_egress = pctx->is_egress; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &re, sizeof(re)); +} -validflow_out: - // Wait with deleting flow until having pushed final RTT message - if (fe.event_info.event == FLOW_EVENT_CLOSING && f_state) { - bpf_map_delete_elem(&flow_state, &p_id.flow); - fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_INGRESS); - bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe, - sizeof(fe)); - } +/* + * Will parse the ingress/egress packet in pctx and attempt to create a + * timestamp for it and match it against the reverse flow. + */ +static void pping(void *ctx, struct parsing_context *pctx) +{ + struct packet_info p_info = { 0 }; + struct flow_state *f_state; + struct flow_event fe; + bool new_flow; + + if (parse_packet_identifier(pctx, &p_info) < 0) + return; + + f_state = update_flow(ctx, &p_info, &fe, &new_flow); + pping_timestamp_packet(f_state, ctx, pctx, &p_info, new_flow); + + f_state = update_rev_flow(ctx, &p_info, &fe); + pping_match_packet(f_state, ctx, pctx, &p_info); +} + +// Programs + +// Egress path using TC-BPF +SEC(SEC_EGRESS_TC) +int pping_tc_egress(struct __sk_buff *skb) +{ + struct parsing_context pctx = { + .data = (void *)(long)skb->data, + .data_end = (void *)(long)skb->data_end, + .pkt_len = skb->len, + .nh = { .pos = pctx.data }, + .ifindex = skb->ingress_ifindex, + .is_egress = true, + }; + + pping(skb, &pctx); + + return TC_ACT_UNSPEC; +} + +// Ingress path using TC-BPF +SEC(SEC_INGRESS_TC) +int pping_tc_ingress(struct __sk_buff *skb) +{ + struct parsing_context pctx = { + .data = (void *)(long)skb->data, + .data_end = (void *)(long)skb->data_end, + .pkt_len = skb->len, + .nh = { .pos = pctx.data }, + .ifindex = skb->ingress_ifindex, + .is_egress = false, + }; + + pping(skb, &pctx); + + return TC_ACT_UNSPEC; +} + +// Ingress path using XDP +SEC(SEC_INGRESS_XDP) +int pping_xdp_ingress(struct xdp_md *ctx) +{ + struct parsing_context pctx = { + .data = (void *)(long)ctx->data, + .data_end = (void *)(long)ctx->data_end, + .pkt_len = pctx.data_end - pctx.data, + .nh = { .pos = pctx.data }, + .ifindex = ctx->ingress_ifindex, + .is_egress = false, + }; + + pping(ctx, &pctx); -out: return XDP_PASS; }