# Edge-IIoTset Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

In [37]:
DATA_ROOT = '/data/edge_iiot/edge_iiot_dataset'
DATA_ATTACK = os.path.join(DATA_ROOT, 'Attack traffic')
DATA_NORMAL = os.path.join(DATA_ROOT, 'Normal traffic')
DATA_SELECTION_MIX = os.path.join(DATA_ROOT, 'Selected dataset for ML and DL')

DURATION_GAP_TOLERANCE = 120

In [3]:
def find_files(path: str, ext: str) -> list:
    found_files = []
    for root, _, files in os.walk(path):
        found_files += [os.path.join(root, desired_file) for desired_file in files if desired_file.endswith(ext)]

    return found_files

## PCAP Analysis

In [4]:
# Count the number of packets
!/data/AAAA_SCRIPTS/countpkts.sh $DATA_ROOT

Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Backdoor_attack.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS HTTP Flood Attacks.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS ICMP Flood Attacks.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS TCP SYN Flood Attacks.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS UDP Flood Attacks.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/MITM (ARP spoofing + DNS) Attack.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/OS Fingerprinting attack.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Password attacks.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Port Scanning attack.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Ransomware attack.pcap
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/SQL injection attack.pcap
Processing /data/edge_iiot/edge_iiot

In [5]:
# Count the number of text file entries for attack data and normal data separately
# Need to be split, otherwise the script would count prepared data for ML/DL too, causing duplicity
!/data/AAAA_SCRIPTS/countflows.sh "$DATA_ATTACK"

Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Backdoor_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS_ICMP_Flood_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Port_Scanning_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Uploading_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/XSS_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Vulnerability_scanner_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS_UDP_Flood_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Ransomware_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/OS_Fingerprinting_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS_HTTP_Flood_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/SQL_injection_attack.csv
Processing /data/edge_iiot/edge_iiot_dataset/Attack traffic/Password_a

In [6]:
!/data/AAAA_SCRIPTS/countflows.sh "$DATA_NORMAL"

Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Temperature_and_Humidity/Temperature_and_Humidity.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Distance/Distance.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Modbus/Modbus.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Heart_Rate/Heart_Rate.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Water_Level/Water_Level.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/phValue/phValue.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Sound_Sensor/Sound_Sensor.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Soil_Moisture/Soil_Moisture.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/Flame_Sensor/Flame_Sensor.csv
Processing /data/edge_iiot/edge_iiot_dataset/Normal traffic/IR_Receiver/IR_Receiver.csv
TOTAL FILES: 10
TOTAL FLOWS: 11209913


In [7]:
9729709 + 11209913

20939622

In [8]:
# Finally, determine the duration of the capture
# According to documentation, the captures should be independent and non-overlapping
!capinfos "$DATA_ATTACK/DDoS UDP Flood Attacks.pcap"

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/DDoS UDP Flood Attacks.pcap
File type:           Wireshark/tcpdump/... - pcap
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: 262144 bytes
Number of packets:   3,215 k
File size:           247 MB
Data size:           196 MB
Capture duration:    400.908273 seconds
First packet time:   2021-11-24 23:27:10.139612
Last packet time:    2021-11-24 23:33:51.047885
Data byte rate:      489 kBps
Data bit rate:       3,913 kbps
Average packet size: 60.98 bytes
Average packet rate: 8,021 packets/s
SHA256:              6de0341f27093cac8a5eba657b8a49ecbac4dd5bf6c452a6274472df69172ce5
RIPEMD160:           3c69a3b6cee2d598dc54bca6217d1eef2a72a106
SHA1:                8270de8742425d33d0b781594d9fc735680829ed
Strict time order:   False
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether)
                     Capture length

In [26]:
# This file is broken, try fixing it
!pcapfix "$DATA_ROOT/Attack traffic/Vulnerability scanner attack.pcap" -o "$DATA_ROOT/Attack traffic/Vulnerability scanner attack.pcap"

pcapfix 1.1.7 (c) 2012-2021 Robert Krause

[*] Reading from file: /data/edge_iiot/edge_iiot_dataset/Attack traffic/Vulnerability scanner attack.pcap
[*] Writing to file: /data/edge_iiot/edge_iiot_dataset/Attack traffic/Vulnerability scanner attack.pcap
[*] File size: 140377652 bytes.
[+] This is a PCAP file.
[*] Analyzing Global Header...
[+] The global pcap header seems to be fine!
[*] Analyzing packets...
[*] Progress:  20.00 %
[*] Progress:  40.00 %
[*] Progress:  60.00 %
[*] Progress:  80.14 %
[*] End of file reached. Aligning last packet.
[-] Cannot align last packet, because it is broken.
[*] Progress: 100.00 %
[*] Wrote 265827 packets to file.
[+] SUCCESS: 1 Corruption(s) fixed!



In [9]:
!/data/AAAA_SCRIPTS/countdur.sh $DATA_ROOT

4786872.048212


This is too much according to the documentation, over 50 days - some PCAPs must contain huge gaps...

In [10]:
files_attack_pcap = find_files(DATA_ATTACK, '.pcap')
files_normal_pcap = find_files(DATA_NORMAL, '.pcap')

In [11]:
for file in files_attack_pcap:
    !capinfos -a -e "$file"
    print()

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/Ransomware attack.pcap
First packet time:   2021-12-02 16:02:07.998055
Last packet time:    2021-12-04 00:24:38.441645



File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/XSS attacks.pcap
First packet time:   2021-11-29 17:44:47.438207
Last packet time:    2021-11-29 19:53:14.731065

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/Backdoor_attack.pcap
First packet time:   2021-12-05 17:58:23.150209
Last packet time:    2021-12-05 21:07:01.118175

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/SQL injection attack.pcap
First packet time:   2021-11-29 21:35:18.238171
Last packet time:    2021-11-29 21:43:43.688520

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/Vulnerability scanner attack.pcap
First packet time:   2021-12-06 19:22:08.593687
Last packet time:    2021-12-06 20:02:03.254705

File name:           /data/edge_iiot/edge_iiot_dataset/Attack traffic/Port Scanning attack.pcap
First packet time:   2021-11-24 16:03:29.475073
Last packet time:    2021-11-24 18:50:09.194016

File name:           /data/edge_iiot/edg

In [12]:
for file in files_normal_pcap:
    !capinfos -a -e "$file"
    print()

File name:           /data/edge_iiot/edge_iiot_dataset/Normal traffic/Temperature_and_Humidity/Temperature_and_Humidity.pcap
First packet time:   2021-11-21 11:44:10.081753
Last packet time:    2021-12-20 01:40:55.043322

File name:           /data/edge_iiot/edge_iiot_dataset/Normal traffic/Distance/Distance.pcap
First packet time:   2021-12-27 23:58:21.314757
Last packet time:    2021-12-28 00:25:13.352243

File name:           /data/edge_iiot/edge_iiot_dataset/Normal traffic/Modbus/Modbus.pcap
First packet time:   2021-12-20 01:22:47.981993
Last packet time:    2022-01-05 07:26:07.966722

File name:           /data/edge_iiot/edge_iiot_dataset/Normal traffic/Heart_Rate/Heart_Rate.pcap
First packet time:   2021-12-28 02:39:39.986982
Last packet time:    2021-12-28 03:10:17.376946

File name:           /data/edge_iiot/edge_iiot_dataset/Normal traffic/Water_Level/Water_Level.pcap
First packet time:   2021-12-28 04:48:17.156663
Last packet time:    2022-01-04 23:09:14.060928

File name:  

We see that while attack scenarios have pretty limited duration, normal captures are sometimes overstretched over several days. This is an incorrect representation of the duration for our purposes. Let's compute the duration via parsing the PCAP contents.

In [28]:
from scapy.all import PcapReader


def compute_pcap_true_duration(filepath: str, max_gap_sec: int):
    contiguous = True
    total_dur = 0
    last_time = 0

    for pkt in PcapReader(filepath):
        pkt_time = pkt.time

        # First packet in the file
        if last_time == 0:
            last_time = pkt_time

        # Check if the specified gap has not been reached
        if last_time + max_gap_sec >= pkt_time:
            total_dur += pkt_time - last_time
        elif contiguous:
            contiguous = False

        last_time = pkt_time

    return float(total_dur), contiguous

In [40]:
# Supposing the files are not overlapping, determine the overall timespan of the capture
total_duration_all_files = 0

for file in files_attack_pcap + files_normal_pcap:
    current_file_duration, _ = compute_pcap_true_duration(file, DURATION_GAP_TOLERANCE)
    total_duration_all_files += current_file_duration

total_duration_all_files

168312.78576800006

In [51]:
from datetime import datetime, timedelta

def convert_time(sec: float):
    d = datetime(1,1,1) + timedelta(seconds=sec)

    print(f"{d.day-1}d {d.hour}h {d.minute}m {d.second}s")

convert_time((total_duration_all_files))

1d 22h 45m 12s


## CSV Analysis -- Merged Files

We omit analysis of single files, since we analyzed PCAPs already. At the point, let's look only at one file -- the merged file for Deep learning, which should be a subset of file for ML.

In [41]:
data = pd.read_csv(os.path.join(DATA_SELECTION_MIX, 'DNN-EdgeIIoT-dataset.csv'))

  data = pd.read_csv(os.path.join(DATA_SELECTION_MIX, 'DNN-EdgeIIoT-dataset.csv'))


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219201 entries, 0 to 2219200
Data columns (total 63 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   frame.time                 object 
 1   ip.src_host                object 
 2   ip.dst_host                object 
 3   arp.dst.proto_ipv4         object 
 4   arp.opcode                 float64
 5   arp.hw.size                float64
 6   arp.src.proto_ipv4         object 
 7   icmp.checksum              float64
 8   icmp.seq_le                float64
 9   icmp.transmit_timestamp    float64
 10  icmp.unused                float64
 11  http.file_data             object 
 12  http.content_length        float64
 13  http.request.uri.query     object 
 14  http.request.method        object 
 15  http.referer               object 
 16  http.request.full_uri      object 
 17  http.request.version       object 
 18  http.response              float64
 19  http.tls_port              float64
 20  tc

In [52]:
len(data)

2219201

In [53]:
data['Attack_type'].value_counts()

Attack_type
Normal                   1615643
DDoS_UDP                  121568
DDoS_ICMP                 116436
SQL_injection              51203
Password                   50153
Vulnerability_scanner      50110
DDoS_TCP                   50062
DDoS_HTTP                  49911
Uploading                  37634
Backdoor                   24862
Port_Scanning              22564
XSS                        15915
Ransomware                 10925
MITM                        1214
Fingerprinting              1001
Name: count, dtype: int64

In [55]:
len(data['Attack_type'].value_counts())

15

In [54]:
data['Attack_label'].value_counts()

Attack_label
0    1615643
1     603558
Name: count, dtype: int64