# X-IIoTID Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

In [2]:
DATAPATH = '/data/data_surv/x-iiotid/x_iiotid_dataset.csv'

In [3]:
data = pd.read_csv(DATAPATH)

  data = pd.read_csv(DATAPATH)


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 820834 entries, 0 to 820833
Data columns (total 68 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   Date                         820503 non-null  object
 1   Timestamp                    820537 non-null  object
 2   Scr_IP                       820834 non-null  object
 3   Scr_port                     820834 non-null  object
 4   Des_IP                       820834 non-null  object
 5   Des_port                     820834 non-null  object
 6   Protocol                     820834 non-null  object
 7   Service                      820834 non-null  object
 8   Duration                     820834 non-null  object
 9   Scr_bytes                    820834 non-null  object
 10  Des_bytes                    820834 non-null  object
 11  Conn_state                   820834 non-null  int64 
 12  missed_bytes                 820834 non-null  object
 13  is_syn_only   

In [32]:
len(data)

820834

In [5]:
# Let's see how does the time and timestamp look like
data[['Date', 'Timestamp']].head()

Unnamed: 0,Date,Timestamp
0,9/01/2020,1578540956
1,13/01/2020,1578871873
2,9/01/2020,1578522486
3,27/02/2020,1582757640
4,16/12/2019,1576452612


In this case, date is essentially useless as timestamp covers the date as well and provides additional information (seconds)

In [6]:
# Apparently, there are NaN values for times -- are they connected to any specific class?
data[data[['Date', 'Timestamp']].isna().any(axis=1)]['class1'].value_counts()

class1
crypto-ransomware    304
Fake_notification     28
Name: count, dtype: int64

In [7]:
data[data['Timestamp'] == '?']['class1'].value_counts()

class1
crypto-ransomware    154
Name: count, dtype: int64

In [8]:
# As values "?" exist in the columns as well, replace them with NaNs to have a better picture of the situation
data = data.replace('?', np.nan)

In [9]:
# Apparently, there are NaN values for times -- are they connected to any specific class?
data[data[['Date', 'Timestamp']].isna().any(axis=1)]['class1'].value_counts()

class1
crypto-ransomware    458
Fake_notification     28
Name: count, dtype: int64

In [10]:
# Are there any other classes with NaN values then?
data[data.isna().any(axis=1)]['class1'].value_counts()

class1
crypto-ransomware    458
Fake_notification     28
Normal                 7
fuzzing                3
Name: count, dtype: int64

In [11]:
# Not that bad, what is the total distribution of values?
data['class1'].value_counts()

class1
Normal                            421417
RDOS                              141261
Scanning_vulnerability             52852
Generic_scanning                   50277
BruteForce                         47241
MQTT_cloud_broker_subscription     23524
Discovering_resources              23148
Exfiltration                       22134
insider_malcious                   17447
Modbus_register_reading             5953
False_data_injection                5094
C&C                                 2863
Dictionary                          2572
TCP Relay                           2119
fuzzing                             1313
Reverse_shell                       1016
crypto-ransomware                    458
MitM                                 117
Fake_notification                     28
Name: count, dtype: int64

In [12]:
# Translate it to percent
data[data.isna().any(axis=1)]['class1'].value_counts() / data['class1'].value_counts()

class1
BruteForce                             NaN
C&C                                    NaN
Dictionary                             NaN
Discovering_resources                  NaN
Exfiltration                           NaN
Fake_notification                 1.000000
False_data_injection                   NaN
Generic_scanning                       NaN
MQTT_cloud_broker_subscription         NaN
MitM                                   NaN
Modbus_register_reading                NaN
Normal                            0.000017
RDOS                                   NaN
Reverse_shell                          NaN
Scanning_vulnerability                 NaN
TCP Relay                              NaN
crypto-ransomware                 1.000000
fuzzing                           0.002285
insider_malcious                       NaN
Name: count, dtype: float64

Ok, 100% of ransomware and fake notification attacks contain NaN values, others are rather negligible.

In [13]:
data['Timestamp'] = data['Timestamp'].astype(np.float64)
data = data.sort_values(by='Timestamp', ascending=True)

In [14]:
data.head()

Unnamed: 0,Date,Timestamp,Scr_IP,Scr_port,Des_IP,Des_port,Protocol,Service,Duration,Scr_bytes,Des_bytes,Conn_state,missed_bytes,is_syn_only,Is_SYN_ACK,is_pure_ack,is_with_payload,FIN or RST,Bad_checksum,is_SYN_with_RST,Scr_pkts,Scr_ip_bytes,Des_pkts,Des_ip_bytes,anomaly_alert,total_bytes,total_packet,paket_rate,byte_rate,Scr_packts_ratio,Des_pkts_ratio,Scr_bytes_ratio,Des_bytes_ratio,Avg_user_time,Std_user_time,Avg_nice_time,Std_nice_time,Avg_system_time,Std_system_time,Avg_iowait_time,Std_iowait_time,Avg_ideal_time,Std_ideal_time,Avg_tps,Std_tps,Avg_rtps,Std_rtps,Avg_wtps,Std_wtps,Avg_ldavg_1,Std_ldavg_1,Avg_kbmemused,Std_kbmemused,Avg_num_Proc/s,Std_num_proc/s,Avg_num_cswch/s,std_num_cswch/s,OSSEC_alert,OSSEC_alert_level,Login_attempt,Succesful_login,File_activity,Process_activity,read_write_physical.process,is_privileged,class1,class2,class3
397665,7/03/2020,1557395000.0,192.168.2.10,50302,192.168.10.150,1883,tcp,mqtt,252.13105,1169,56,1,0,False,False,True,True,False,False,False,28,2625,14,784,False,4634,42,0.166580038,18.37933091,0.666666667,0.333333333,0.818731118,0.181268882,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,0,0,0,1,0,False_data_injection,Tampering,Attack
651139,7/03/2020,1557395000.0,192.168.2.10,50304,192.168.10.150,1883,tcp,mqtt,252.131227,847,56,1,0,False,False,True,True,False,False,False,28,2303,14,784,False,3990,42,0.166579921,15.82509254,0.666666667,0.333333333,0.789473684,0.210526316,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,0,0,0,1,0,False_data_injection,Tampering,Attack
439618,7/03/2020,1557395000.0,192.168.2.10,50316,192.168.10.150,1883,tcp,mqtt,-,-,-,1,0,False,False,True,True,False,False,False,-,-,-,-,False,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,0,0,0,1,0,False_data_injection,Tampering,Attack
394343,7/03/2020,1557395000.0,192.168.2.10,50314,192.168.10.150,1883,tcp,mqtt,-,-,-,1,0,False,False,True,True,False,False,False,-,-,-,-,False,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,0,0,0,1,0,False_data_injection,Tampering,Attack
258425,7/03/2020,1557395000.0,192.168.2.10,50306,192.168.10.150,1883,tcp,mqtt,-,-,-,1,0,False,False,True,True,False,False,False,-,-,-,-,False,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,0,0,0,0,0,0,1,0,False_data_injection,Tampering,Attack


In [28]:
data['Service'].value_counts()

Service
http                        237738
dns                         222713
coap                        167236
other                        83699
websocket                    58462
mqtt                         17121
https                         9517
modbus                        6525
smtp                          5322
ssh                           5040
private                       3607
simple_service_discovery      1645
netbios-ns                    1256
echo                           642
imap                           165
dhcp                            91
mysql                           55
Name: count, dtype: int64

In [29]:
# Let's see the classes and their distributions
data['class1'].value_counts()

class1
Normal                            421417
RDOS                              141261
Scanning_vulnerability             52852
Generic_scanning                   50277
BruteForce                         47241
MQTT_cloud_broker_subscription     23524
Discovering_resources              23148
Exfiltration                       22134
insider_malcious                   17447
Modbus_register_reading             5953
False_data_injection                5094
C&C                                 2863
Dictionary                          2572
TCP Relay                           2119
fuzzing                             1313
Reverse_shell                       1016
crypto-ransomware                    458
MitM                                 117
Fake_notification                     28
Name: count, dtype: int64

In [30]:
data['class2'].value_counts()

class2
Normal               421417
RDOS                 141261
Reconnaissance       127590
Weaponization         67260
Lateral _movement     31596
Exfiltration          22134
Tampering              5122
C&C                    2863
Exploitation           1133
crypto-ransomware       458
Name: count, dtype: int64

In [31]:
data['class3'].value_counts()

class3
Normal    421417
Attack    399417
Name: count, dtype: int64

## Timespan Computation

In [18]:
data_nonans = data.dropna()
data_nonans.replace('Timestamp', data_nonans['Timestamp'].astype(np.int64))
data_nonans = data_nonans.sort_values(by='Timestamp', ascending=False)

In [34]:
# Cature start
data_nonans['Timestamp'].iloc[-1]

1557394756.0

In [33]:
# Caputure end
data_nonans['Timestamp'].iloc[0]

1583569803.0

Based on this analysis, the data in the dataset span almost a year (From May 2019 to March 2020). The data is also provided in a single file, so regardless of the outcome below, we consider it as discontinous. We don't even apply 1% continous rule here, as it would not make any sense here.

In [26]:
# Compute timespan of the capture and the contiguity
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = 0
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= gap_max_secs:
            total_dur += dur_gap
        elif contiguous:
            contiguous = False

        last_tstamp = cur_tstamp

    return pd.Timedelta(seconds=total_dur), contiguous

In [27]:
# Based just on the above analysis, the dataset
measure_real_capture_dur(data_nonans['Timestamp'], 300)

(Timedelta('6 days 07:07:40'), False)