Demonstrates extraction of relevant flow data into hdf5 using `libtrace` and `pandas`

##### Getting trace for testing
1. Download a trace from [MAWI Traffic Archive](http://mawi.wide.ad.jp/mawi/) 
2. Trucate to a workable time duration (e.g. 1 minute) with [tracesplit](https://github.com/LibtraceTeam/libtrace/wiki/tracesplit): `tracesplit -i 60 -m 1 inputuri outputuri`

In [None]:
TRACE_PATH = '/tmp/anon-v4.pcap'
STORE_PATH = '/tmp/anon-v4.hdf5'

In [None]:
import os
import sys

import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import plt as libtrace

from typing import Tuple

from utils.managed_trace import managed_trace
from utils.tcp_packet import read_tcp_packet, TCPPacket

We read from the trace, filtering TCP packets, and store the relevant fields of the packets in a list of namedtuples.

In [None]:
filter_ = libtrace.filter('tcp')
with managed_trace(TRACE_PATH) as trace:
    trace.conf_filter(filter_)
    tcp_packets = [read_tcp_packet(packet) for packet in trace]
tcp_df = pd.DataFrame.from_records(tcp_packets, columns=TCPPacket._fields)

Now we have a `dataframe` with the following columns:

In [None]:
list(tcp_df)

We store the `src_ip:src_port`, `dst_ip:dst_port` in a separate map

    flow_hash -> ((ip_a, src_a), (ip_b, src_b))
    
such that `ip_a < ip_b`. 

In [None]:
def sorted_tcp_tuple(packet) -> Tuple[Tuple[str, int], Tuple[str, int]]:
    src, dst = (str(packet.src_ip), packet.src_port), (str(packet.dst_ip), packet.dst_port)
    return (src, dst) if packet.src_ip < packet.dst_ip else (dst, src)

flow_to_tcp_tuple = {p.flow_hash: sorted_tcp_tuple(p) for p in tcp_df.itertuples()}

We remove these cols from the dataframe and store the direction of the packets as a boolean `src_ip < dst_ip` instead.

In [None]:
tcp_df['ip_direction_asc'] = (tcp_df.src_ip < tcp_df.dst_ip).astype(bool)
del tcp_df['src_ip']
del tcp_df['src_port']
del tcp_df['dst_ip']
del tcp_df['dst_port']

In [None]:
with pd.HDFStore(STORE_PATH) as store:
    store['tcp_df'] = tcp_df
    store.get_storer('tcp_df').attrs.flow_to_tcp_tuple = flow_to_tcp_tuple

## Stats

In [None]:
# Number of packets sent per flow
tcp_df['flow_hash'].value_counts().hist(bins=100)

In [None]:
# Average number of packets
tcp_df['flow_hash'].value_counts().describe()

**TODO** 

- Histogram with average number of packets sent per flow
- Packets along time axis
- Data cleaning, connection splittin?
- Do we need payload_length?