# Aggregate data

By default, tracepoint and systemd IP accounting data is sampled every 5s by
[nix-bitcoin-monitor](https://github.com/virtu/nix-bitcoin-monitor). This means
there's a lot of data points, working with which can cause long runtimes. For
traffic analysis, hourly or daily granularity is sufficient, so this notebook
aggregates data in this fashion to make analysis more responsive.

# Extract

In [18]:
import pandas as pd

opts = dict(compression="bz2", index_col=0, parse_dates=True)
df_tp = pd.read_csv("tracepoints_preprocessed.csv.bz2", **opts)
df_sys = pd.read_csv("systemd_preprocessed.csv.bz2", **opts)

# Transform

TCP/IP traffic is estimated using the following assumptions:
- MTU size is 1500 bytes (common default)
- Bitcoin protocol overhead is 24 bytes (4-byte magic, 12-byte command, 4-byte
  each for payload length and checksum)
- TCP header size of 32 (default)
- IPv4 and v6 header sizes of 20 and 40 bytes (default)

The estimate uses the following approach. First, the application-level message
size is computed by adding the Bitcoin P2P message overhead to the message size.
Next, the number of TCP segments is computed by dividing the application-level
size obtained during the previous step by the maximum segment size (which
corresponds to the MTU minus TCP and IP headers) to compute the number of TCP
segments. Then, the total TCP/IP overhead is computed (number of segments times
TCP and IP header overhead). Finally, TCP/IP traffic is estimated by combining
the application-level message size with the total TCP/IP overhead.

Next, empirical TCP/IP measurements obtained via systemd accounting are combined
with the estimate so the latter can be validated.

### TCP/IP estimate

In [None]:
import math
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)


def estimate_network_traffic(row):
    MAX_MTU_SIZE = 1500
    BITCOIN_PROTOCOL_OVERHEAD = 24
    TCP_HEADER_SIZE = 32
    IP_HEADER_SIZE = 40 if row["ipv6"] else 20
    MSS = MAX_MTU_SIZE - IP_HEADER_SIZE - TCP_HEADER_SIZE
    bitcoin_message_size = row["size"] + BITCOIN_PROTOCOL_OVERHEAD
    num_segments = math.ceil(bitcoin_message_size / MSS)
    tcpip_overhead = num_segments * (IP_HEADER_SIZE + TCP_HEADER_SIZE)
    return bitcoin_message_size + tcpip_overhead


df_tp["net_size"] = df_tp.parallel_apply(estimate_network_traffic, axis=1)

### Combine and aggregate data

First, the dataframe contaiing empirical data from systemd's IP accounting is
pivoted so it can be aggregated.

Next, the pivoted df and the tracepoint df are aggregated to produce hourly and
daily data.

In [38]:
df_sys_t = (
    df_sys.rename(columns={"IPIngressBytes": "in", "IPEgressBytes": "out"})[
        ["in", "out"]
    ]
    .stack()
    .rename("net_size")
    .reset_index()
    .rename(columns={"level_1": "flow"})
    .set_index("timestamp")
)


def agg_sum(df, cols, freq, data="net_size"):
    """Aggregate 'data' col based on datetime index with frequency 'freq', using
    summation using 'cols' as differentiator."""
    df_tmp = df.copy()
    df_tmp["freq"] = df_tmp.index.floor(freq)
    df_result = (
        df_tmp.groupby(["freq"] + cols)["net_size"]
        .sum()
        .reset_index()
        .set_index("freq")
    )
    return df_result


dfs = {
    "est_hourly": agg_sum(df_tp, ["flow", "msg_type"], "1h"),
    "est_daily": agg_sum(df_tp, ["flow", "msg_type"], "1d"),
    "emp_hourly": agg_sum(df_sys_t, ["flow"], "1h"),
    "emp_daily": agg_sum(df_sys_t, ["flow"], "1d"),
}

## Load

In [41]:
for name, df in dfs.items():
    df.to_csv(f"data_{name}.csv.bz2", compression="bz2")