# Kent 2016 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
DATA_FLOWS = 'data/kent2016/flows.txt'
DATA_AUTH  = 'data/kent2016/auth.txt'
DATA_RTEAM = 'data/kent2016/redteam.txt'
DATA_PROC  = 'data/kent2016/proc.txt'
DATA_DNS   = 'data/kent2016/dns.txt'

FLOWS_HEADER_NAMES = ['time', 'duration', 'srccomp', 'srcport', 'dstcomp', 'dstport', 'proto', 'pkts', 'bytes']
FLOWS_HEADER_TYPES = {'time': np.uint32, 'duration': np.uint16, 'srccomp': str, 'srcport': str, 'dstcomp': str, 'dstport': str, 'proto': str}

In [3]:
# The overall file size is big, count the number head it
!head $DATA_FLOWS

1,0,C1065,389,C3799,N10451,6,10,5323
1,0,C1423,N1136,C1707,N1,6,5,847
1,0,C1423,N1142,C1707,N1,6,5,847
1,0,C14909,N8191,C5720,2049,6,1,52
1,0,C14909,N8192,C5720,2049,6,1,52
1,0,C14909,N8193,C5720,2049,6,1,52
1,0,C1707,N1,C1423,N1136,6,4,414
1,0,C1707,N1,C1423,N1142,6,4,413
1,0,C1707,N1,C925,N10487,6,4,414
1,0,C1707,N1,C925,N10491,6,4,413


In [4]:
# Count the number of its entries
!wc -l $DATA_FLOWS

129977412 data/kent2016/flows.txt


In [5]:
# Read only a sample, since all cannot be fit into the memory at once
flows = pd.read_csv(DATA_FLOWS, names=FLOWS_HEADER_NAMES, dtype=FLOWS_HEADER_TYPES, nrows=10000000)

In [6]:
# How big part of the data is loaded?
len(flows) / 129977412

0.07693644492629227

Less than 8%

In [7]:
flows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 9 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   time      uint32
 1   duration  uint16
 2   srccomp   object
 3   srcport   object
 4   dstcomp   object
 5   dstport   object
 6   proto     object
 7   pkts      int64 
 8   bytes     int64 
dtypes: int64(2), object(5), uint16(1), uint32(1)
memory usage: 591.3+ MB


In [8]:
flows.head(10)

Unnamed: 0,time,duration,srccomp,srcport,dstcomp,dstport,proto,pkts,bytes
0,1,0,C1065,389,C3799,N10451,6,10,5323
1,1,0,C1423,N1136,C1707,N1,6,5,847
2,1,0,C1423,N1142,C1707,N1,6,5,847
3,1,0,C14909,N8191,C5720,2049,6,1,52
4,1,0,C14909,N8192,C5720,2049,6,1,52
5,1,0,C14909,N8193,C5720,2049,6,1,52
6,1,0,C1707,N1,C1423,N1136,6,4,414
7,1,0,C1707,N1,C1423,N1142,6,4,413
8,1,0,C1707,N1,C925,N10487,6,4,414
9,1,0,C1707,N1,C925,N10491,6,4,413


As apparent, the dataset is heavily anonymized, being rather unsuitable for pure flow-based attack detection. Also, labels are missing.

In [9]:
# Number of unique source and destination hosts
flows['srccomp'].nunique(), flows['dstcomp'].nunique(),

(7180, 6076)

In [10]:
flows['dstport'].value_counts()

445       1603313
80         359297
389        347118
137        216539
88         210137
           ...   
N48300          1
N48402          1
N48406          1
N48405          1
N48977          1
Name: dstport, Length: 49107, dtype: int64

## Timing and Continuity Analysis

In [11]:
flows.iloc[-1]

time        100787
duration         0
srccomp     C12649
srcport      N2354
dstcomp      C5721
dstport        445
proto            6
pkts             1
bytes           46
Name: 9999999, dtype: object

So the first 1M samples (8% of data) cover 100787 seconds (~28 hours). We can measure if this data block is continous.

In [12]:
flows_sorted = flows.sort_values(by='time', ascending=False)

In [13]:
flows_sorted['tstamp'] = pd.to_datetime(flows_sorted['time'], unit='s')

In [14]:
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = 0

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations


In [15]:
measure_real_capture_dur(flows_sorted['tstamp'], 1008)

(Timedelta('1 days 03:59:46'), True, [])

Based on the analysis of this sample, we consider the dataset is continuous. In the command line, we also analyzed the whole dataset.