# VHS-22 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)

In [2]:
DATA_PATH = '/data/disk2/vhs-22/vhs-22-network-traffic-dataset.csv'

In [3]:
DATA_DTYPES = {
    'entry_id' : np.uint32,
    'ack_nr_count' : np.uint32,
    'b_packet_first_q' : np.int32,
    'b_packet_max' : np.uint32,
    'b_packet_mean' : np.float32,
    'b_packet_median' : np.float32,
    'b_packet_min' : np.uint32,
    'b_packet_std' : np.float32,
    'b_packet_third_q' : np.int32,
    'b_packet_total' : np.uint32,
    'dport': np.int32,
    'duration' : np.float32,
    'fin_nr_count' : np.int32,
    'first_timestamp' : str,
    'flags_sum' : np.int32,
    'iat_first_q' : np.int32,
    'iat_max' : np.float32,
    'iat_mean' : np.float32,
    'iat_median' : np.float32,
    'iat_min' : np.float32,
    'iat_std' : np.float32,
    'iat_third_q' : np.int32,
    'iat_var' : np.float32,
    'in_packets' : np.uint32,
    'ip_dst_str' : str,
    'ip_protocol' : np.int32,
    'ip_src_str' : str,
    'last_timestamp' : str,
    'psh_nr_count' : np.uint16,
    'rst_nr_count' : np.uint16,
    'sport' : np.int32,
    'syn_nr_count' : np.uint16,
    'urg_nr_count' : np.uint16,
    'connections_from_this_host' : np.uint32,
    'connections_to_this_host' : np.uint32,
    'connections_rst_to_this_host' : np.uint16,
    'connections_rst_from_this_host' : np.uint16,
    'connections_to_this_port' : np.uint32,
    'connections_from_this_port' : np.uint32,
    'connections_ratio_from_this_host' : np.float32,
    'connections_ratio_to_this_host' : np.float32,
    'connections_ratio_rst_to_this_host' : np.float32,
    'connections_ratio_rst_from_this_host' : np.float32,
    'connections_ratio_to_this_port' : np.float32,
    'connections_ratio_from_this_port' : np.float32,
    'label' : np.uint8,
    'attack_label' : str,
    'attack_file' : str,
}

In [4]:
# Load the data, there is a semi-separator ";", so we need to specify it manually
data = pd.read_csv(DATA_PATH, sep=';', low_memory=True, dtype=DATA_DTYPES)#, nrows=15000000)

In [5]:
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27735675 entries, 0 to 27735674
Data columns (total 48 columns):
 #   Column                                Non-Null Count     Dtype  
---  ------                                --------------     -----  
 0   Unnamed: 0                            27735675 non-null  int64  
 1   ack_nr_count                          27735675 non-null  uint32 
 2   b_packet_first_q                      27735675 non-null  int32  
 3   b_packet_max                          27735675 non-null  uint32 
 4   b_packet_mean                         27735675 non-null  float32
 5   b_packet_median                       27735675 non-null  float32
 6   b_packet_min                          27735675 non-null  uint32 
 7   b_packet_std                          27735675 non-null  float32
 8   b_packet_third_q                      27735675 non-null  int32  
 9   b_packet_total                        27735675 non-null  uint32 
 10  dport                                 27

In [6]:
len(data)

27735675

In [7]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,ack_nr_count,b_packet_first_q,b_packet_max,b_packet_mean,b_packet_median,b_packet_min,b_packet_std,b_packet_third_q,b_packet_total,dport,duration,fin_nr_count,first_timestamp,flags_sum,iat_first_q,iat_max,iat_mean,iat_median,iat_min,iat_std,iat_third_q,iat_var,in_packets,ip_dst_str,ip_protocol,ip_src_str,last_timestamp,psh_nr_count,rst_nr_count,sport,syn_nr_count,urg_nr_count,connections_from_this_host,connections_to_this_host,connections_rst_to_this_host,connections_rst_from_this_host,connections_to_this_port,connections_from_this_port,connections_ratio_from_this_host,connections_ratio_to_this_host,connections_ratio_rst_to_this_host,connections_ratio_rst_from_this_host,connections_ratio_to_this_port,connections_ratio_from_this_port,label,attack_label,attack_file
0,7661877,0,91,91,91.0,91.0,91,0.0,91,364,427,9.002522,0,2022-01-01 00:00:00.002448,0,0,3.001299,3.000841,3.001299,3.00055,0.000328,0,1.076142e-07,4,239.255.255.253,17,131.243.61.70,2022-01-01 00:00:09.004970,0,0,59666,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,NORMAL,isot
1,7564618,0,88,88,88.0,88.0,88,0.0,88,88,53,0.0,0,2022-01-01 00:00:00.016580,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,128.3.97.204,17,128.3.164.249,2022-01-01 00:00:00.016580,0,0,45308,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,NORMAL,isot
2,7364701,1,66,66,66.0,66.0,66,0.0,66,66,5834,0.0,0,2022-01-01 00:00:00.024558,16,0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,128.3.23.41,6,220.195.34.143,2022-01-01 00:00:00.024558,0,0,80,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,NORMAL,isot
3,6641093,4,78,258,117.0,258.0,66,81.553665,78,468,5836,0.083575,1,2022-01-01 00:00:00.025059,17,0,0.082334,0.027858,0.000125,0.000125,0.038522,0,0.001483965,4,128.3.23.41,6,220.195.34.143,2022-01-01 00:00:00.108634,1,0,80,1,0,14,15,0,0,0,31,0.112903,0.120968,0.0,0.0,0.0,0.25,0,NORMAL,isot
4,6692585,1,66,66,66.0,66.0,66,0.0,66,66,5835,0.0,0,2022-01-01 00:00:00.026183,16,0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,128.3.23.41,6,220.195.34.143,2022-01-01 00:00:00.026183,0,0,80,0,0,28,31,0,0,0,114,0.055888,0.061876,0.0,0.0,0.0,0.227545,0,NORMAL,isot


In [8]:
data.tail(5)

Unnamed: 0.1,Unnamed: 0,ack_nr_count,b_packet_first_q,b_packet_max,b_packet_mean,b_packet_median,b_packet_min,b_packet_std,b_packet_third_q,b_packet_total,dport,duration,fin_nr_count,first_timestamp,flags_sum,iat_first_q,iat_max,iat_mean,iat_median,iat_min,iat_std,iat_third_q,iat_var,in_packets,ip_dst_str,ip_protocol,ip_src_str,last_timestamp,psh_nr_count,rst_nr_count,sport,syn_nr_count,urg_nr_count,connections_from_this_host,connections_to_this_host,connections_rst_to_this_host,connections_rst_from_this_host,connections_to_this_port,connections_from_this_port,connections_ratio_from_this_host,connections_ratio_to_this_host,connections_ratio_rst_to_this_host,connections_ratio_rst_from_this_host,connections_ratio_to_this_port,connections_ratio_from_this_port,label,attack_label,attack_file
27735670,23053261,0,60,60,60.0,60.0,60,0.0,60,180,10304,174.668427,0,2022-01-23 23:59:59.937674,0,0,173.429153,87.334213,0.0,1.239273,86.09494,0,7412.338867,3,46.52.123.7,17,147.32.84.229,2022-01-23 00:02:54.606101,0,0,13363,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0,NORMAL,capture20110815-3.truncated.pcap.txt
27735671,23266861,0,85,85,85.0,85.0,85,0.0,85,85,53,0.0,0,2022-01-23 23:59:59.962724,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,147.32.80.9,17,147.32.84.2,2022-01-23 23:59:59.962724,0,0,59002,0,0,0,1,0,0,1,0,0.0,0.25,0.0,0.0,0.25,0.0,0,NORMAL,capture20110815-3.truncated.pcap.txt
27735672,22823001,0,237,237,237.0,237.0,237,0.0,237,237,59002,0.0,0,2022-01-23 23:59:59.963025,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,147.32.84.2,17,147.32.80.9,2022-01-23 23:59:59.963025,0,0,53,0,0,84,4,0,0,0,84,0.167665,0.007984,0.0,0.0,0.0,0.167665,0,NORMAL,capture20110815-3.truncated.pcap.txt
27735673,22175202,21,74,451,83.86364,66.21891,66,80.133011,74,1845,80,0.879941,1,2022-01-23 23:59:59.980219,17,0,0.176273,0.041902,0.000225,3e-06,0.074785,0,0.005593,22,147.32.84.4,6,67.195.111.244,2022-01-23 00:00:00.860160,1,0,55843,1,0,1,1,0,0,30,0,0.009524,0.009524,0.0,0.0,0.285714,0.0,0,NORMAL,capture20110815-3.truncated.pcap.txt
27735674,22175201,28,74,1514,1288.928589,1485.137817,66,479.86084,74,36090,55843,0.704036,1,2022-01-23 23:59:59.980491,17,0,0.176208,0.026075,5.2e-05,2e-06,0.062432,0,0.003898,28,67.195.111.244,6,147.32.84.4,2022-01-23 00:00:00.684527,5,0,80,1,0,1,1,0,0,0,29,0.009346,0.009346,0.0,0.0,0.0,0.271028,0,NORMAL,capture20110815-3.truncated.pcap.txt


In [9]:
data.describe()

Unnamed: 0.1,Unnamed: 0,ack_nr_count,b_packet_first_q,b_packet_max,b_packet_mean,b_packet_median,b_packet_min,b_packet_std,b_packet_third_q,b_packet_total,dport,duration,fin_nr_count,flags_sum,iat_first_q,iat_max,iat_mean,iat_median,iat_min,iat_std,iat_third_q,iat_var,in_packets,ip_protocol,psh_nr_count,rst_nr_count,sport,syn_nr_count,urg_nr_count,connections_from_this_host,connections_to_this_host,connections_rst_to_this_host,connections_rst_from_this_host,connections_to_this_port,connections_from_this_port,connections_ratio_from_this_host,connections_ratio_to_this_host,connections_ratio_rst_to_this_host,connections_ratio_rst_from_this_host,connections_ratio_to_this_port,connections_ratio_from_this_port,label
count,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735675.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735675.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735675.0,27735680.0,27735680.0,27735680.0,27735680.0,27735680.0,27735675.0,27735680.0,27735680.0,27735680.0
mean,13867840.0,20.26148,300.7267,519.8169,370.7956,348.0421,289.4888,84.98149,300.7267,19248.26,23749.33,4925.9,0.1945019,6.498711,0.0,4016.045,1437.203,157.0034,952.0309,844.0299,0.0,886489900.0,25.99812,12.81092,2.708522,0.07808802,17431.95,0.2833092,0.003160046,29.5472,90.1625,23.12753,0.0,27.76191,80.04166,0.07614268,0.2304044,0.05039991,0.0,0.07335164,0.2097918,0.266634
std,8006600.0,3997.504,490.3435,852.1417,528.9402,607.6129,485.712,294.4873,490.3435,3121975.0,22556.69,138961.6,0.3958168,12.7938,0.0,124184.5,59836.6,8455.82,56499.32,29762.09,0.0,62552000000.0,4044.3,5.341487,165.1269,0.2683101,22391.48,0.9504746,0.7052764,81.03529,172.3558,126.0707,0.0,71.66904,160.5717,0.1756657,0.3606826,0.2566366,0.0,0.1630096,0.3423469,0.4421994
min,0.0,0.0,44.0,44.0,44.0,0.0,44.0,0.0,44.0,44.0,0.0,0.0,0.0,0.0,0.0,-0.001678944,-1.774153,-47.72279,-3613.304,0.0,0.0,0.0,1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6933918.0,0.0,62.0,72.0,68.0625,60.0,60.0,0.0,62.0,81.0,443.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,0.0,53.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,13867840.0,0.0,75.0,108.0,103.0,77.0,74.0,0.0,75.0,242.0,13363.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,17.0,0.0,0.0,2513.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.007984032,0.0,0.0,0.0,0.0,0.0
75%,20801760.0,1.0,144.0,699.0,316.2222,213.0,133.0,6.0,144.0,1514.0,46978.0,5.442859,0.0,17.0,0.0,3.909783,0.7698053,0.0002153939,0.0001571178,0.07970088,0.0,0.00635223,5.0,17.0,0.0,0.0,38439.0,0.0,0.0,10.0,65.0,0.0,0.0,1.0,58.0,0.04379562,0.3133733,0.0,0.0,0.01596806,0.3033932,1.0
max,27735670.0,11663880.0,4434.0,24874.0,5995.2,11698.0,2131.0,6947.739,4434.0,4283533000.0,65535.0,8241753.0,1.0,255.0,0.0,8235610.0,8235610.0,6323398.0,8235610.0,4113667.0,0.0,16922250000000.0,11663880.0,17.0,65310.0,1.0,65535.0,3190.0,3190.0,500.0,500.0,1000.0,0.0,500.0,500.0,1.0,1.0,2.0,0.0,1.0,1.0,1.0


In [10]:
# Label distribution
data['attack_label'].value_counts()

attack_label
NORMAL                                                                         20340401
DOS                                                                             6778273
BOTNET                                                                           260448
CTU_BOTNET                                                                       196600
STORM_P2P_BOTNET                                                                  63552
MedusaHTTP-malware-traffic                                                        16324
Urnsif-infection-with-Trickbot                                                     6625
Emotet-infection-with-spambot-traffic-and-Trickbot                                 5205
Emotet-epoch-3-with-Trickbot-gtag-mor60-infection-traffic                          4973
WEB_ATTACKS                                                                        4006
FTP_BRUTE_FORCE                                                                    4002
Emotet-infection-wi

In [11]:
# Number of attack classes
len(data['attack_label'].value_counts())

116

In [12]:
# Binary labels distribution
data['label'].value_counts()

label
0    20340401
1     7395274
Name: count, dtype: int64

In [13]:
# Show concrete files out of which the label was extracted
data['attack_file'].value_counts()

attack_file
anon-Booter6.pcap.txt                                                          3624984
capture20110810.truncated.pcap.txt                                             3567642
capture20110817.truncated.pcap.txt                                             2616069
capture20110811.truncated.pcap.txt                                             2319717
isot                                                                           2316923
capture20110815-3.truncated.pcap.txt                                           2200862
anon-Booter5.pcap.txt                                                          2040341
capture20110815.truncated.pcap.txt                                             1668547
cic_dos                                                                        1094098
cic_botnet                                                                     1040711
capture20110818.truncated.pcap.txt                                              921024
capture20110816.truncated.pcap.

In [14]:
data['dport'].value_counts()

dport
80       2996316
13363    2312440
53       2193210
443      1090007
12114     275897
          ...   
288           77
27            76
486           75
160           73
242           73
Name: count, Length: 65536, dtype: int64

## Temporal Analysis

In [15]:
# How does timestamps look like?
data['last_timestamp']

0           2022-01-01 00:00:09.004970
1           2022-01-01 00:00:00.016580
2           2022-01-01 00:00:00.024558
3           2022-01-01 00:00:00.108634
4           2022-01-01 00:00:00.026183
                       ...            
27735670    2022-01-23 00:02:54.606101
27735671    2022-01-23 23:59:59.962724
27735672    2022-01-23 23:59:59.963025
27735673    2022-01-23 00:00:00.860160
27735674    2022-01-23 00:00:00.684527
Name: last_timestamp, Length: 27735675, dtype: object

In [16]:
# Seems like these can be easily converted
timestamps = pd.to_datetime(data['last_timestamp'])
timestamps = timestamps.sort_values(ascending=False)
timestamps

27297360   2022-01-23 23:59:59.997396
27297359   2022-01-23 23:59:59.997034
26862254   2022-01-23 23:59:59.996723
27735672   2022-01-23 23:59:59.963025
27735671   2022-01-23 23:59:59.962724
                      ...            
7          2022-01-01 00:00:00.036817
6          2022-01-01 00:00:00.036442
4          2022-01-01 00:00:00.026183
2          2022-01-01 00:00:00.024558
1          2022-01-01 00:00:00.016580
Name: last_timestamp, Length: 27735675, dtype: datetime64[ns]

In [17]:
# Compute the capture span
capture_span = timestamps.iloc[0] - timestamps.iloc[-1]
capture_span

Timedelta('22 days 23:59:59.980816')

In [18]:
# What is the biggest allowed gap in such a setup? That's a lot
capture_span * 0.01

Timedelta('0 days 05:31:11.999808160')

OK, so the capture spans 23 days (from start to begin).

In [19]:
# Measure capture duration and continousness
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

In [20]:
measure_real_capture_dur(timestamps, int(capture_span.total_seconds() * 0.01))

(Timedelta('5 days 05:27:07.864942'),
 False,
 [Timedelta('1 days 23:42:14.485329'),
  Timedelta('0 days 08:46:19.524909'),
  Timedelta('0 days 08:27:31.826894'),
  Timedelta('0 days 08:06:21.056063'),
  Timedelta('0 days 08:04:26.795344'),
  Timedelta('0 days 08:02:06.223602'),
  Timedelta('0 days 07:36:00.782501'),
  Timedelta('0 days 06:07:14.325575'),
  Timedelta('0 days 05:11:12.124123'),
  Timedelta('0 days 04:11:35.559359'),
  Timedelta('0 days 04:11:13.504965'),
  Timedelta('0 days 02:26:42.017690'),
  Timedelta('0 days 02:09:09.867811'),
  Timedelta('0 days 01:45:02.156494'),
  Timedelta('0 days 01:13:32.337453'),
  Timedelta('0 days 00:30:05.883894'),
  Timedelta('0 days 00:21:17.435626'),
  Timedelta('0 days 00:16:11.258150'),
  Timedelta('0 days 00:06:18.546551'),
  Timedelta('0 days 00:04:49.285091'),
  Timedelta('0 days 00:02:49.230085'),
  Timedelta('0 days 00:02:34.886034'),
  Timedelta('0 days 00:02:18.751399')])

As seen, even though the traffic was replayed with a pre-set time, the continuous durations are rather short and gaps between them exist. What happens if we decrease the maximum gap to 1 hour instead of 5?

In [21]:
measure_real_capture_dur(timestamps, 3600)

(Timedelta('5 days 05:27:07.864942'),
 False,
 [Timedelta('1 days 23:42:14.485329'),
  Timedelta('0 days 08:46:19.524909'),
  Timedelta('0 days 08:27:31.826894'),
  Timedelta('0 days 08:06:21.056063'),
  Timedelta('0 days 08:04:26.795344'),
  Timedelta('0 days 08:02:06.223602'),
  Timedelta('0 days 07:36:00.782501'),
  Timedelta('0 days 06:07:14.325575'),
  Timedelta('0 days 05:11:12.124123'),
  Timedelta('0 days 04:11:35.559359'),
  Timedelta('0 days 04:11:13.504965'),
  Timedelta('0 days 02:26:42.017690'),
  Timedelta('0 days 02:09:09.867811'),
  Timedelta('0 days 01:45:02.156494'),
  Timedelta('0 days 01:13:32.337453'),
  Timedelta('0 days 00:30:05.883894'),
  Timedelta('0 days 00:21:17.435626'),
  Timedelta('0 days 00:16:11.258150'),
  Timedelta('0 days 00:06:18.546551'),
  Timedelta('0 days 00:04:49.285091'),
  Timedelta('0 days 00:02:49.230085'),
  Timedelta('0 days 00:02:34.886034'),
  Timedelta('0 days 00:02:18.751399')])

Ok, so the gaps are no bigger than 1 hour and the duration stays the same. Nevertheless, the data cannot be considered continous.