# Kitsune Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

Source: https://www.kaggle.com/datasets/ymirsky/network-attack-dataset-kitsune

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
DATAPATH_ROOT = '/data/surv/kitsune'

# Sample
DATAPATH_SAMPLE_SYNDOS = os.path.join(DATAPATH_ROOT, 'syn_dos')
SAMPLE_PCAP   = os.path.join(DATAPATH_SAMPLE_SYNDOS, 'SYN_DoS_pcap.pcap')
SAMPLE_CSV    = os.path.join(DATAPATH_SAMPLE_SYNDOS, 'SYN_DoS_dataset.csv')
SAMPLE_LABELS = os.path.join(DATAPATH_SAMPLE_SYNDOS, 'SYN_DoS_labels.csv')

In [3]:
# Open sample PCAP to have idea how do they look like
!capinfos $SAMPLE_PCAP

File name:           /data/surv/kitsune/syn_dos/SYN_DoS_pcap.pcap
File type:           Wireshark/... - pcapng
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: (not set)
Packet size limit:   inferred: 200 bytes
Number of packets:   2,771 k
File size:           522 MB
Data size:           2,229 MB
Capture duration:    3169.220773 seconds
First packet time:   2017-02-25 09:28:32.825815
Last packet time:    2017-02-25 10:21:22.046588
Data byte rate:      703 kBps
Data bit rate:       5,628 kbps
Average packet size: 804.56 bytes
Average packet rate: 874 packets/s
SHA256:              98a988f50e5e1a25c43e6275e3fe5ef682737252705315c1ae185735cd19eb1e
RIPEMD160:           1bc43c8c2e27ed85f42680874719be92cb759f57
SHA1:                c1700c7d91d9ebece1dbb35a737c5b5da77a6432
Strict time order:   False
Capture application: Editcap 2.4.1
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether

In [4]:
# Print unique source IP addresses (hosts in the network)
!tshark -r $SAMPLE_PCAP -T fields -e ip.src | sort | uniq | sort -n


0.0.0.0
192.168.100.5
192.168.3.1
192.168.3.107
192.168.3.11
192.168.3.1,192.168.3.12
192.168.3.1,192.168.3.13
192.168.3.1,192.168.3.14
192.168.3.12
192.168.3.13
192.168.3.14


Essentially 7 different hosts, corresponds to a small network. Some experiments were designed in other network environment according to the paper, but this environment is also sized a few hosts.

In [5]:
!head -n 10 $SAMPLE_CSV

1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,1.450000000000000000e+03,0.000000000000000000e+00,

In [24]:
# Compute the number of features by counting the number of delimiters +1
!head -n 1 $SAMPLE_CSV | tr -cd ',' | wc -c

114


In [6]:
!head -n 10 $SAMPLE_LABELS

"","x"
"1",0
"2",0
"3",0
"4",0
"5",0
"6",0
"7",0
"8",0
"9",0


Each feature vector should correspond to one packet, which should correspond to one label. Verify if this claim from the paper holds.

In [7]:
!tshark -r $SAMPLE_PCAP | wc -l

2771276


In [8]:
!cat $SAMPLE_CSV | wc -l

2771276


In [9]:
!cat $SAMPLE_LABELS | wc -l

2771276


The assumption holds for this sample case. Since CSVs provide custom feature vectors, we will focus on PCAPs and label files in the future.

# PCAP Analysis

In [10]:
from pathlib import Path

pcap_paths = []

for path in Path(DATAPATH_ROOT).rglob('*.pcap*'):
    pcap_paths.append(path.absolute())

In [11]:
for pcap_file in pcap_paths:
    !capinfos -a -e $pcap_file
    print()

File name:           /data/surv/kitsune/active_wiretap/Active_Wiretap_pcap.pcapng
Packet size limit:   inferred: 200 bytes
First packet time:   2017-08-09 10:56:53.367032
Last packet time:    2017-08-09 11:18:49.604291

File name:           /data/surv/kitsune/ssdp_flood/SSDP_Flood_pcap.pcap
Packet size limit:   inferred: 200 bytes
First packet time:   2017-02-25 10:46:10.018869
Last packet time:    2017-02-25 11:26:54.140941

File name:           /data/surv/kitsune/syn_dos/SYN_DoS_pcap.pcap
Packet size limit:   inferred: 200 bytes
First packet time:   2017-02-25 09:28:32.825815
Last packet time:    2017-02-25 10:21:22.046588

File name:           /data/surv/kitsune/ssl_renegotiation/SSL_Renegotiation_pcap.pcap
Packet size limit:   inferred: 200 bytes
First packet time:   2017-05-04 21:47:37.653090
Last packet time:    2017-05-04 22:26:21.717726

File name:           /data/surv/kitsune/mirai_botnet/Mirai_pcap.pcap
Packet size limit:   inferred: 200 bytes
First packet time:   2018-10-25 

Since captures are not overlapping, we can straightly compute the total 
duration of the capture.

In [12]:
!/data/AAAA_SCRIPTS/countpkts.sh $DATAPATH_ROOT

Processing /data/surv/kitsune/active_wiretap/Active_Wiretap_pcap.pcapng


Processing /data/surv/kitsune/arp_mitm/ARP_MitM_pcap.pcapng
Processing /data/surv/kitsune/fuzzing/Fuzzing_pcap.pcapng
Processing /data/surv/kitsune/mirai_botnet/Mirai_pcap.pcap
Processing /data/surv/kitsune/os_scan/OS_Scan_pcap.pcapng
Processing /data/surv/kitsune/ssdp_flood/SSDP_Flood_pcap.pcap
Processing /data/surv/kitsune/ssl_renegotiation/SSL_Renegotiation_pcap.pcap
Processing /data/surv/kitsune/syn_dos/SYN_DoS_pcap.pcap
Processing /data/surv/kitsune/video_injection/Video_Injection_pcap.pcapng

TOTAL FILES : 9
TOTAL PACKETS: 21017597


In [13]:
# Determine the total capture duration.
# From the timestamps, it is obvious it is discontiguous
!/data/AAAA_SCRIPTS/countdur.sh $DATAPATH_ROOT

24596.141711


In [27]:
from datetime import datetime, timedelta

def convert_time(sec: int):
    sec = timedelta(seconds=sec)
    d = datetime(1,1,1) + sec

    print("%dd %dh %dm %ds" % (d.day-1, d.hour, d.minute, d.second))

convert_time(24596)

0d 6h 49m 56s


# CSV Analysis

Ivestigate labels, since CSV themselves are not very attractive after all.

In [14]:
# Identify label files
from pathlib import Path
import re

labels_all = pd.DataFrame(columns=['x'])

for path in Path(DATAPATH_ROOT).rglob('*label*.csv'):
    print(path)
    labels = pd.read_csv(path.absolute())

    if re.match(r'.*mirai_labels.csv$', str(path.absolute())):
        labels = labels.rename(columns={'0': 'x'})

    labels_all = pd.concat([labels_all, labels['x']])

labels_all = labels_all.reset_index(drop=True)

/data/surv/kitsune/active_wiretap/Active_Wiretap_labels.csv


/data/surv/kitsune/ssdp_flood/SSDP_Flood_labels.csv
/data/surv/kitsune/syn_dos/SYN_DoS_labels.csv
/data/surv/kitsune/ssl_renegotiation/SSL_Renegotiation_labels.csv
/data/surv/kitsune/mirai_botnet/mirai_labels.csv
/data/surv/kitsune/arp_mitm/ARP_MitM_labels.csv
/data/surv/kitsune/video_injection/Video_Injection_labels.csv
/data/surv/kitsune/os_scan/OS_Scan_labels.csv
/data/surv/kitsune/fuzzing/Fuzzing_labels.csv


In [15]:
labels_all

Unnamed: 0,x
0,0
1,0
2,0
3,0
4,0
...,...
21017591,0
21017592,0
21017593,0
21017594,0


Apparently, one 1 label row is missing - this could be technically investigated to find out the precise file and packet index, but we are not going to pull the investigation at this work.

In [16]:
# See the label distibution
labels_all.value_counts()

x
0    16166316
1     4851280
Name: count, dtype: int64

In [17]:
labels_all.value_counts().iloc[1] / len(labels_all)

0.23081992821633834

Around 23% of traffic is malicious.