# IoT-23 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
# Select a root directory after unpacking the data (small dataset, we analyzed big dataset with PCAPs in command line)
ROOT_DIR = '/data/disk2/iot23/opt/Malware-Project/BigDataset/IoTScenarios/'

## Preparation

In [3]:
# There are several IoT-23 capture scenarios
os.listdir(ROOT_DIR)

['CTU-Honeypot-Capture-5-1',
 'CTU-IoT-Malware-Capture-48-1',
 'CTU-IoT-Malware-Capture-39-1',
 'CTU-IoT-Malware-Capture-17-1',
 'CTU-IoT-Malware-Capture-43-1',
 'CTU-IoT-Malware-Capture-3-1',
 'CTU-IoT-Malware-Capture-21-1',
 'CTU-Honeypot-Capture-4-1',
 'CTU-IoT-Malware-Capture-60-1',
 'CTU-IoT-Malware-Capture-35-1',
 'CTU-IoT-Malware-Capture-9-1',
 'CTU-IoT-Malware-Capture-44-1',
 'CTU-IoT-Malware-Capture-8-1',
 'CTU-IoT-Malware-Capture-7-1',
 'CTU-IoT-Malware-Capture-33-1',
 'CTU-IoT-Malware-Capture-42-1',
 'CTU-IoT-Malware-Capture-49-1',
 'CTU-IoT-Malware-Capture-36-1',
 'CTU-IoT-Malware-Capture-20-1',
 'CTU-IoT-Malware-Capture-34-1',
 'CTU-IoT-Malware-Capture-1-1',
 'CTU-IoT-Malware-Capture-52-1',
 'CTU-Honeypot-Capture-7-1']

In [4]:
# We further select one honeypot and one malware capture for further analysis
PATH_HONEYPOT = os.path.join(ROOT_DIR, 'CTU-Honeypot-Capture-4-1/bro/conn.log.labeled')
PATH_MALWARE  = os.path.join(ROOT_DIR, 'CTU-IoT-Malware-Capture-49-1/bro/conn.log.labeled')

In [5]:
# Data are in a bro format, how do they look like?
!head -n 15 $PATH_HONEYPOT

#separator \x09
#set_separator	,
#empty_field	(empty)
#unset_field	-
#path	conn
#open	2019-12-05-15-46-36
#fields	ts	uid	id.orig_h	id.orig_p	id.resp_h	id.resp_p	proto	service	duration	orig_bytes	resp_bytes	conn_state	local_orig	local_resp	missed_bytes	history	orig_pkts	orig_ip_bytes	resp_pkts	resp_ip_bytes	tunnel_parents   label   detailed-label
#types	time	string	addr	port	addr	port	enum	string	interval	count	count	string	bool	bool	count	string	count	count	count	count	set[string]   string   string
1540469302.538640	CGm6jB4dXK71ZDWUDh	192.168.1.132	58687	216.239.35.4	123	udp	-	0.114184	48	48	SF	-	-	0	Dd	1	76	1	76	-   benign   -
1540469197.400159	CnaDAG3n5r8eiG4su2	192.168.1.132	1900	239.255.255.250	1900	udp	-	160.367579	7536	0	S0	-	-	0	D	24	8208	0	0	-   benign   -
1540469385.734089	CUrxU238nt0m6yTgKf	192.168.1.132	32893	216.239.35.8	123	udp	-	0.016986	48	48	SF	-	-	0	Dd	1	76	1	76	-   benign   -
1540469831.302625	CGQf8t1kjdxB5PHXL4	192.168.1.132	53395	2.16.60.82	443	tcp	-	0.003497	0	0	SF

Essentially, we need need to select the separataror as tab character, strip all lines starting with `#` character. The fields will be copied from there and added manually during data loading.

In [6]:
IOT23_FEATURES = [
    'ts',
    'uid',
    'id.orig_h',
    'id.orig_p',
    'id.resp_h',
    'id.resp_p',
    'proto',
    'service',
    'duration',
    'orig_bytes',
    'resp_bytes',
    'conn_state',
    'local_orig',
    'local_resp',
    'missed_bytes',
    'history',
    'orig_pkts',
    'orig_ip_bytes',
    'resp_pkts',
    'resp_ip_bytes',
    'tunnel_parents',
    'label',
    'detailed-label',
]

In [7]:
# I also tried this, but the data does not contain values, but sometimes "-" as a value
# The data would require more cleaning in order to apply correct datatypes - keeping as is for this analysis
IOT23_TYPES = {
    'ts':  np.float32,
    'uid' : str,
    'id.orig_h' : str,
    'id.orig_p': np.int32,
    'id.resp_h': str,
    'id.resp_p': np.int32,
    'proto': str,
    'service': str,
    'duration': np.float32,
    'orig_bytes': np.uint16,
    'resp_bytes': np.uint16,
    'conn_state': str,
    'local_orig': str,
    'local_resp': str,
    'missed_bytes': np.uint16,
    'history': str,
    'orig_pkts': np.uint16,
    'orig_ip_bytes': np.uint16,
    'resp_pkts': np.uint16,
    'resp_ip_bytes': np.uint16,
    'tunnel_parents': str,
    'label': str,
    'detailed-label': str,
}

In [8]:
# Therefore, in order to prepare a file, we need to strip asll its metadata lines (delimited with #)
# We further discovered, that label features (apparently added later, not by bro) are not delimited by a tab character, but spaces
# This data quality is frustrating, but well, spaces need to be converted to tabs before proceeding
import re

def prepare_iot23_file(file_orig, file_new) -> None:
    """Prepares IoT-23 files. However, the whole file needs to be loaded into RAM, which might be infeasible for larger captures.
    Adjust the code if needed."""
    new_contents = None
    old_contents = None

    # Read the original file
    with open(file_orig, 'r') as in_file:
        old_contents = in_file.readlines()

    # Prepare the file contents
    new_contents = ''.join([line for line in old_contents if not line.startswith('#')])
    new_contents = re.sub('[ ]{2,}', '\t', new_contents)    # Replace 2 and more spaces with tab character

    # Write to the new file
    with open(file_new, 'w') as out_file:
        out_file.write(new_contents)

In [9]:
# Filenames for the updated files
PATH_HONEYPOT_MODIF = PATH_HONEYPOT + '_modif'
PATH_MALWARE_MODIF = PATH_MALWARE + '_modif'

# Prepare both honeypot and malware capture files
prepare_iot23_file(PATH_HONEYPOT, PATH_HONEYPOT_MODIF)
prepare_iot23_file(PATH_MALWARE, PATH_MALWARE_MODIF)

## Honeypot Capture

In [10]:
data_honeypot = pd.read_csv(PATH_HONEYPOT_MODIF, sep='\t', names=IOT23_FEATURES)

In [11]:
data_honeypot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ts              452 non-null    float64
 1   uid             452 non-null    object 
 2   id.orig_h       452 non-null    object 
 3   id.orig_p       452 non-null    int64  
 4   id.resp_h       452 non-null    object 
 5   id.resp_p       452 non-null    int64  
 6   proto           452 non-null    object 
 7   service         452 non-null    object 
 8   duration        452 non-null    object 
 9   orig_bytes      452 non-null    object 
 10  resp_bytes      452 non-null    object 
 11  conn_state      452 non-null    object 
 12  local_orig      452 non-null    object 
 13  local_resp      452 non-null    object 
 14  missed_bytes    452 non-null    int64  
 15  history         452 non-null    object 
 16  orig_pkts       452 non-null    int64  
 17  orig_ip_bytes   452 non-null    int

In [12]:
len(data_honeypot)

452

In [13]:
data_honeypot.describe()

Unnamed: 0,ts,id.orig_p,id.resp_p,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes
count,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0
mean,1540508000.0,48306.522124,201.44469,45.232301,29.610619,7543.912,7.80531,581.85177
std,24356.46,11877.479603,327.329725,440.060252,412.030448,130951.1,125.368803,6097.843503
min,1540469000.0,68.0,53.0,0.0,1.0,62.0,0.0,0.0
25%,1540485000.0,41905.0,53.0,0.0,1.0,65.0,1.0,76.0
50%,1540508000.0,52594.0,80.0,0.0,1.0,76.0,1.0,144.0
75%,1540530000.0,56663.25,123.0,0.0,4.0,172.0,2.0,325.0
max,1540548000.0,60977.0,1900.0,7363.0,8124.0,2778408.0,2667.0,129675.0


In [14]:
data_honeypot.head(10)

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents,label,detailed-label
0,1540469000.0,CGm6jB4dXK71ZDWUDh,192.168.1.132,58687,216.239.35.4,123,udp,-,0.114184,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-
1,1540469000.0,CnaDAG3n5r8eiG4su2,192.168.1.132,1900,239.255.255.250,1900,udp,-,160.367579,7536,0,S0,-,-,0,D,24,8208,0,0,-,benign,-
2,1540469000.0,CUrxU238nt0m6yTgKf,192.168.1.132,32893,216.239.35.8,123,udp,-,0.016986,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-
3,1540470000.0,CGQf8t1kjdxB5PHXL4,192.168.1.132,53395,2.16.60.82,443,tcp,-,0.003497,0,0,SF,-,-,0,ShAFf,5,212,3,144,-,benign,-
4,1540470000.0,CUo9DH2QDnCaBIGjkg,192.168.1.132,52801,192.168.1.1,53,udp,dns,0.036724,34,311,SF,-,-,0,Dd,1,62,1,339,-,benign,-
5,1540469000.0,CAvXOZ3htimWEtglii,192.168.1.132,1900,239.255.255.250,1900,udp,-,384.518261,15072,0,S0,-,-,0,D,48,16416,0,0,-,benign,-
6,1540470000.0,CfJsUD2NGQvnK2p7Vd,192.168.1.132,58124,216.239.35.12,123,udp,-,0.270332,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-
7,1540470000.0,CVwKZS98dRvk1jeH2,192.168.1.132,35313,216.239.35.0,123,udp,-,0.111429,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-
8,1540470000.0,CzbHG4aoHRooWvyMg,192.168.1.132,46064,216.239.35.4,123,udp,-,0.114433,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-
9,1540470000.0,CbTB0B2ZnnDWLAIml3,192.168.1.132,45230,216.239.35.8,123,udp,-,0.016984,48,48,SF,-,-,0,Dd,1,76,1,76,-,benign,-


In [15]:
# Label distribution
data_honeypot['label'].value_counts()

label
benign    452
Name: count, dtype: int64

In [16]:
# Source IP distribution
data_honeypot['id.orig_h'].value_counts()

id.orig_h
192.168.1.132    452
Name: count, dtype: int64

In [17]:
# Protocol distribution
data_honeypot['proto'].value_counts()

proto
udp    310
tcp    142
Name: count, dtype: int64

In [18]:
# Service distribution
data_honeypot['service'].value_counts()

service
-       205
dns     191
http     54
dhcp      2
Name: count, dtype: int64

## Malware Capture

In [19]:
data_malware = pd.read_csv(PATH_MALWARE_MODIF, sep='\t', names=IOT23_FEATURES)

  data_malware = pd.read_csv(PATH_MALWARE_MODIF, sep='\t', names=IOT23_FEATURES)


In [20]:
data_malware.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5410561 entries, 0 to 5410560
Data columns (total 23 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   ts              5410561 non-null  float64
 1   uid             5410561 non-null  object 
 2   id.orig_h       5410561 non-null  object 
 3   id.orig_p       5410561 non-null  int64  
 4   id.resp_h       5410561 non-null  object 
 5   id.resp_p       5410561 non-null  int64  
 6   proto           5410561 non-null  object 
 7   service         5410561 non-null  object 
 8   duration        5410561 non-null  object 
 9   orig_bytes      5410561 non-null  object 
 10  resp_bytes      5410561 non-null  object 
 11  conn_state      5410561 non-null  object 
 12  local_orig      5410561 non-null  object 
 13  local_resp      5410561 non-null  object 
 14  missed_bytes    5410561 non-null  int64  
 15  history         5410561 non-null  object 
 16  orig_pkts       5410561 non-null  in

In [21]:
len(data_malware)

5410561

In [22]:
data_malware.describe()

Unnamed: 0,ts,id.orig_p,id.resp_p,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes
count,5410561.0,5410561.0,5410561.0,5410561.0,5410561.0,5410561.0,5410561.0,5410561.0
mean,1551397000.0,27347.0,8077.058,0.0,3.0026,120.1892,0.004027863,0.9855185
std,7611.845,3259.144,165.095,0.0,1.951891,84.09645,0.9249058,525.4039
min,1551383000.0,3.0,0.0,0.0,1.0,40.0,0.0,0.0
25%,1551390000.0,24159.0,8081.0,0.0,2.0,80.0,0.0,0.0
50%,1551397000.0,30535.0,8081.0,0.0,2.0,80.0,0.0,0.0
75%,1551403000.0,30535.0,8081.0,0.0,4.0,160.0,0.0,0.0
max,1551410000.0,60700.0,30535.0,0.0,1796.0,93758.0,1792.0,434528.0


In [23]:
data_malware.head(10)

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents,label,detailed-label
0,1551383000.0,Cnkx4U1xJgOIPooe2,192.168.1.193,51358,134.209.72.171,80,tcp,http,1.874424,149,198998,SF,-,-,0,ShADadttfF,234,12994,285,418624,-,Malicious,FileDownload
1,1551383000.0,C5t6Kp1UnQyH7JCC45,192.168.1.193,51360,134.209.72.171,80,tcp,http,1.871487,151,198966,SF,-,-,0,ShADadttfF,282,15702,282,412612,-,Malicious,FileDownload
2,1551383000.0,CRdXqi4p3HG1obs6xl,192.168.1.193,51362,134.209.72.171,80,tcp,http,2.376374,148,140547,SF,-,-,0,ShADadttfF,174,9600,201,291562,-,Malicious,FileDownload
3,1551383000.0,Co7Sfi3KfjPoJwW5Kf,192.168.1.193,51364,134.209.72.171,80,tcp,-,3.092539,0,0,S0,-,-,0,S,6,360,0,0,-,Benign,-
4,1551383000.0,CKaKqD34bwvqLUcID9,192.168.1.193,51364,134.209.72.171,80,tcp,http,1.35978,148,137299,SF,-,-,0,ShADadttfF,178,9840,196,284806,-,Malicious,FileDownload
5,1551383000.0,CFk7Y1SrQmKfgTlp2,192.168.1.193,51366,134.209.72.171,80,tcp,http,1.952813,151,209560,SF,-,-,0,ShADadtftF,220,11878,296,434528,-,Malicious,FileDownload
6,1551383000.0,CyLuux1ZhzNIPptfna,192.168.1.193,24159,108.116.222.190,8081,tcp,-,6e-06,0,0,S0,-,-,0,S,2,80,0,0,-,Malicious,PartOfAHorizontalPortScan
7,1551383000.0,CA2IXnRpixOsx5fIh,192.168.1.193,24159,197.142.48.202,8081,tcp,-,6e-06,0,0,S0,-,-,0,S,2,80,0,0,-,Malicious,PartOfAHorizontalPortScan
8,1551383000.0,C9MktJrfmw7GTH4yg,192.168.1.193,24159,197.37.170.62,8081,tcp,-,2e-06,0,0,S0,-,-,0,S,2,80,0,0,-,Malicious,PartOfAHorizontalPortScan
9,1551383000.0,COqW6nwmLrjaVpPvc,192.168.1.193,24159,217.222.4.197,8081,tcp,-,6e-06,0,0,S0,-,-,0,S,2,80,0,0,-,Malicious,PartOfAHorizontalPortScan


In [24]:
# Label distribution
data_malware['label'].value_counts()

label
Malicious    5406896
Benign          3665
Name: count, dtype: int64

In [25]:
# Detailed label descriptions
data_malware['detailed-label'].value_counts()

detailed-label
PartOfAHorizontalPortScan    5404959
-                               3665
C&C                             1922
FileDownload                      14
C&C-FileDownload                   1
Name: count, dtype: int64

In [26]:
# Source IPs distribution
data_malware['id.orig_h'].value_counts()

id.orig_h
192.168.1.193      5408909
197.13.3.22             53
197.13.3.14             33
197.13.3.6              30
217.136.48.1            11
                    ...   
217.250.204.130          1
217.255.194.23           1
217.83.205.24            1
217.232.49.98            1
14.1.5.94                1
Name: count, Length: 1258, dtype: int64

In [27]:
# Destination ports distribution
data_malware['id.resp_p'].value_counts()

id.resp_p
8081     5406794
4554        1923
13           685
0            555
1            349
123          168
3             33
80            17
10            16
30535          7
24159          4
53             4
22             3
5353           2
67             1
Name: count, dtype: int64

In [28]:
# Services distribution
data_malware['service'].value_counts()

service
-       5410535
http         19
dns           6
dhcp          1
Name: count, dtype: int64

### Continuity Analysis

In [29]:
# Float-based timestamps need to be converted into python timestamp object
data_malware['tstamp'] = pd.to_datetime(data_malware['ts'], unit='s', origin='unix')

In [30]:
# Sort the file based on these timestamps
data_malware_sorted = data_malware.sort_values(by='tstamp', ascending=False)

In [31]:
# Compute the span
span = data_malware_sorted.iloc[0]['tstamp'] - data_malware_sorted.iloc[-1]['tstamp']
span

Timedelta('0 days 07:19:46.802563191')

In [32]:
# Are there gaps in the data (was the capture interrupted?)
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

In [33]:
# Calculate time span and continuity of the given capture
measure_real_capture_dur(data_malware_sorted['tstamp'], int(span.total_seconds() * 0.01))

(Timedelta('0 days 07:19:46.802563191'),
 True,
 [Timedelta('0 days 07:19:46.802563191')])

As apparent, individual capture files are continuous. However, in total, the dataset consists of several discontinous captures (based on our PCAP analysis), so the overall verdict of the data signifies discontinuity.