# SIMARGL2022 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

Data source: [https://www.kaggle.com/datasets/h2020simargl/simargl2022](https://www.kaggle.com/datasets/h2020simargl/simargl2022)

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pathlib
import os

pd.set_option('display.max_columns', None)

In [2]:
PATH_ROOT = '/data/disk2/simargl2021'

In [3]:
ALL_CSVS = sorted(os.listdir(PATH_ROOT))
ALL_CSVS

['dos-03-15-2022-15-44-32.csv',
 'dos-03-16-2022-13-45-18.csv',
 'dos-03-17-2022-16-22-53.csv',
 'dos-03-18-2022-19-27-05.csv',
 'dos-03-19-2022-20-01-53.csv',
 'dos-03-20-2022-14-27-54.csv',
 'malware-03-25-2022-17-57-07.csv',
 'normal-03-15-2022-15-43-44.csv',
 'normal-03-16-2022-13-44-27.csv',
 'normal-03-17-2022-16-21-30.csv',
 'normal-03-18-2022-19-17-31.csv',
 'normal-03-18-2022-19-25-48.csv',
 'normal-03-19-2022-20-01-16.csv',
 'normal-03-20-2022-14-27-30.csv',
 'portscanning-03-15-2022-15-44-06.csv',
 'portscanning-03-16-2022-13-44-50.csv',
 'portscanning-03-17-2022-16-22-53.csv',
 'portscanning-03-18-2022-19-27-05.csv',
 'portscanning-03-19-2022-20-01-45.csv',
 'portscanning-03-20-2022-14-27-49.csv']

There are lot of files separated by classes and days. According to their names, there is a high probability their will not be continous already. When we consider one capture 03-25-2022, the file is indeed not continuous. Let's try to load all the file into one DataFrame for analysis since they are not big themselves.

In [4]:
data_all = pd.DataFrame()

for file in ALL_CSVS:
    filepath = os.path.join(PATH_ROOT, file)

    data_partial = pd.read_csv(filepath)

    data_all = pd.concat([data_all, data_partial])

## NaN Values Investigation

In [5]:
data_all.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 30148258 entries, 0 to 291301
Data columns (total 32 columns):
 #   Column                      Non-Null Count     Dtype  
---  ------                      --------------     -----  
 0   FLOW_ID                     30148258 non-null  int64  
 1   PROTOCOL_MAP                30148258 non-null  object 
 2   L4_SRC_PORT                 30148258 non-null  int64  
 3   IPV4_SRC_ADDR               30148258 non-null  object 
 4   L4_DST_PORT                 30148258 non-null  int64  
 5   IPV4_DST_ADDR               30148258 non-null  object 
 6   FIRST_SWITCHED              30148258 non-null  int64  
 7   FLOW_DURATION_MILLISECONDS  30148258 non-null  int64  
 8   LAST_SWITCHED               30148258 non-null  int64  
 9   PROTOCOL                    30148258 non-null  int64  
 10  TCP_FLAGS                   30148258 non-null  int64  
 11  TCP_WIN_MAX_IN              30148258 non-null  int64  
 12  TCP_WIN_MAX_OUT             30148258 non-null  

In [6]:
# As apparent, fields ALERT and ANOMALY are someimes Null. What are their values?
data_all['ANOMALY'].value_counts()

ANOMALY
1.0    3036969
0.0    2905979
Name: count, dtype: int64

In [7]:
data_all['ALERT'].value_counts()

ALERT
Denial of Service    6425161
Port Scanning        4170392
Malware                  571
Name: count, dtype: int64

In [8]:
# Is there any correlation between the two? E.g., anomaly is non-NaN, is also alert the non-NaN
data_all[data_all['ANOMALY'].isna() == False]['ALERT'].isna().sum()

2695645

Ok, so these two above hypothesis does not stand. Nevertheless, analyze the data.

In [9]:
# Filter the files separately
DOS_FNAMES = filter(lambda fname: fname.startswith('dos'), ALL_CSVS)
MW_FNAMES = filter(lambda fname: fname.startswith('malware'), ALL_CSVS)
NORMAL_FNAMES = filter(lambda fname: fname.startswith('normal'), ALL_CSVS)
SCAN_FNAMES = filter(lambda fname: fname.startswith('portscanning'), ALL_CSVS)

In [10]:
for file in ALL_CSVS:
    # Try if pandas can open the file -- if the same file contains the same number of cols
    one_file_data = pd.read_csv(os.path.join(PATH_ROOT, file), nrows=10)

    # Print the number of file's columns
    print(f'{len(one_file_data.columns)} : {file}')

31 : dos-03-15-2022-15-44-32.csv
31 : dos-03-16-2022-13-45-18.csv
31 : dos-03-17-2022-16-22-53.csv
31 : dos-03-18-2022-19-27-05.csv
32 : dos-03-19-2022-20-01-53.csv
32 : dos-03-20-2022-14-27-54.csv
32 : malware-03-25-2022-17-57-07.csv
31 : normal-03-15-2022-15-43-44.csv
31 : normal-03-16-2022-13-44-27.csv
31 : normal-03-17-2022-16-21-30.csv
31 : normal-03-18-2022-19-17-31.csv
31 : normal-03-18-2022-19-25-48.csv
32 : normal-03-19-2022-20-01-16.csv
32 : normal-03-20-2022-14-27-30.csv
31 : portscanning-03-15-2022-15-44-06.csv
31 : portscanning-03-16-2022-13-44-50.csv
31 : portscanning-03-17-2022-16-22-53.csv
31 : portscanning-03-18-2022-19-27-05.csv
32 : portscanning-03-19-2022-20-01-45.csv
32 : portscanning-03-20-2022-14-27-49.csv


Apparently, the feature extractor was changed on 03-19-22 and produces 1 more feature since.

## Actual Data Analysis

In [11]:
# Keep analyzing the whole dataset for now
data_all.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 30148258 entries, 0 to 291301
Data columns (total 32 columns):
 #   Column                      Non-Null Count     Dtype  
---  ------                      --------------     -----  
 0   FLOW_ID                     30148258 non-null  int64  
 1   PROTOCOL_MAP                30148258 non-null  object 
 2   L4_SRC_PORT                 30148258 non-null  int64  
 3   IPV4_SRC_ADDR               30148258 non-null  object 
 4   L4_DST_PORT                 30148258 non-null  int64  
 5   IPV4_DST_ADDR               30148258 non-null  object 
 6   FIRST_SWITCHED              30148258 non-null  int64  
 7   FLOW_DURATION_MILLISECONDS  30148258 non-null  int64  
 8   LAST_SWITCHED               30148258 non-null  int64  
 9   PROTOCOL                    30148258 non-null  int64  
 10  TCP_FLAGS                   30148258 non-null  int64  
 11  TCP_WIN_MAX_IN              30148258 non-null  int64  
 12  TCP_WIN_MAX_OUT             30148258 non-null  

In [12]:
len(data_all)

30148258

In [13]:
data_all.head()

Unnamed: 0,FLOW_ID,PROTOCOL_MAP,L4_SRC_PORT,IPV4_SRC_ADDR,L4_DST_PORT,IPV4_DST_ADDR,FIRST_SWITCHED,FLOW_DURATION_MILLISECONDS,LAST_SWITCHED,PROTOCOL,TCP_FLAGS,TCP_WIN_MAX_IN,TCP_WIN_MAX_OUT,TCP_WIN_MIN_IN,TCP_WIN_MIN_OUT,TCP_WIN_MSS_IN,TCP_WIN_SCALE_IN,TCP_WIN_SCALE_OUT,SRC_TOS,DST_TOS,TOTAL_FLOWS_EXP,MIN_IP_PKT_LEN,MAX_IP_PKT_LEN,TOTAL_PKTS_EXP,TOTAL_BYTES_EXP,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,ALERT,ANALYSIS_TIMESTAMP,ANOMALY
0,334626451,tcp,50242,10.114.241.191,443,10.114.224.73,1647344604,105,1647344604,6,31,64240,27960,64240,27960,1460,7,8,0,0,334626451,0,0,0,0,551,5,192,4,Denial of Service,1647344695,
1,334626948,tcp,50244,10.114.241.191,443,10.114.224.73,1647344609,6,1647344609,6,31,64240,27960,64240,27960,1460,7,8,0,0,334626948,0,0,0,0,551,5,192,4,Denial of Service,1647344695,
2,334627641,tcp,50246,10.114.241.191,443,10.114.224.73,1647344614,111,1647344614,6,31,64240,27960,64240,27960,1460,7,8,0,0,334627641,0,0,0,0,551,5,192,4,Denial of Service,1647344695,
3,334628195,tcp,50248,10.114.241.191,443,10.114.224.73,1647344619,6,1647344619,6,31,64240,27960,64240,27960,1460,7,8,0,0,334628195,0,0,0,0,551,5,192,4,Denial of Service,1647344695,
4,334628673,tcp,54328,10.114.241.191,443,10.114.224.73,1647344534,110000,1647344644,6,24,501,122,501,122,0,0,0,0,0,334628673,0,0,0,0,1309,12,624,12,Denial of Service,1647344695,


In [14]:
data_all.tail()

Unnamed: 0,FLOW_ID,PROTOCOL_MAP,L4_SRC_PORT,IPV4_SRC_ADDR,L4_DST_PORT,IPV4_DST_ADDR,FIRST_SWITCHED,FLOW_DURATION_MILLISECONDS,LAST_SWITCHED,PROTOCOL,TCP_FLAGS,TCP_WIN_MAX_IN,TCP_WIN_MAX_OUT,TCP_WIN_MIN_IN,TCP_WIN_MIN_OUT,TCP_WIN_MSS_IN,TCP_WIN_SCALE_IN,TCP_WIN_SCALE_OUT,SRC_TOS,DST_TOS,TOTAL_FLOWS_EXP,MIN_IP_PKT_LEN,MAX_IP_PKT_LEN,TOTAL_PKTS_EXP,TOTAL_BYTES_EXP,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,ALERT,ANALYSIS_TIMESTAMP,ANOMALY
291297,373409629,tcp,62837,10.114.241.166,44550,10.114.224.53,1647773700,0,1647773700,6,2,1024,0,1024,0,1460,0,0,0,0,373409629,0,0,0,0,44,1,0,0,Port Scanning,1647773781,1.0
291298,373409628,tcp,62836,10.114.241.166,38086,10.114.224.117,1647773700,1,1647773700,6,22,1024,0,1024,0,1460,0,0,0,0,373409628,0,0,0,0,44,1,40,1,Port Scanning,1647773781,1.0
291299,373409631,tcp,62837,10.114.241.166,48791,10.114.224.203,1647773700,0,1647773700,6,2,1024,0,1024,0,1460,0,0,0,0,373409631,0,0,0,0,44,1,0,0,Port Scanning,1647773781,1.0
291300,373409630,tcp,62836,10.114.241.166,26881,10.114.224.63,1647773700,0,1647773700,6,2,1024,0,1024,0,1460,0,0,0,0,373409630,0,0,0,0,44,1,0,0,Port Scanning,1647773781,1.0
291301,373409633,tcp,62836,10.114.241.166,47947,10.114.224.203,1647773700,0,1647773700,6,2,1024,0,1024,0,1460,0,0,0,0,373409633,0,0,0,0,44,1,0,0,Port Scanning,1647773781,1.0


In [15]:
data_all.describe()

Unnamed: 0,FLOW_ID,L4_SRC_PORT,L4_DST_PORT,FIRST_SWITCHED,FLOW_DURATION_MILLISECONDS,LAST_SWITCHED,PROTOCOL,TCP_FLAGS,TCP_WIN_MAX_IN,TCP_WIN_MAX_OUT,TCP_WIN_MIN_IN,TCP_WIN_MIN_OUT,TCP_WIN_MSS_IN,TCP_WIN_SCALE_IN,TCP_WIN_SCALE_OUT,SRC_TOS,DST_TOS,TOTAL_FLOWS_EXP,MIN_IP_PKT_LEN,MAX_IP_PKT_LEN,TOTAL_PKTS_EXP,TOTAL_BYTES_EXP,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,ANALYSIS_TIMESTAMP,ANOMALY
count,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,30148258.0,30148258.0,30148258.0,30148258.0,30148260.0,30148260.0,30148260.0,30148260.0,30148260.0,5942948.0
mean,355024800.0,44572.22,8029.699,1647537000.0,5127.937,1647537000.0,9.269864,9.235459,21494.39,6220.592,21437.77,6202.765,690.8209,2.300147,0.7769335,7.831136,8.686281,355024800.0,0.0,0.0,0.0,0.0,12097.68,25.44525,40763.11,34.66279,1647537000.0,0.5110206
std,13572250.0,16987.17,16778.61,120468.7,18954.97,120469.0,5.424333,15.16865,29744.45,18186.9,29739.78,18163.75,729.8235,3.358245,2.42899,32.72519,34.84271,13572250.0,0.0,0.0,0.0,0.0,1302045.0,1179.611,2669808.0,1941.03,120376.7,0.4998786
min,3.0,0.0,0.0,1647330000.0,0.0,1647330000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,28.0,1.0,0.0,0.0,1647330000.0,0.0
25%,351433400.0,37928.0,53.0,1647440000.0,0.0,1647440000.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,351433400.0,0.0,0.0,0.0,0.0,60.0,1.0,0.0,0.0,1647440000.0,0.0
50%,357089600.0,49323.0,443.0,1647518000.0,32.0,1647518000.0,6.0,2.0,1024.0,0.0,1024.0,0.0,0.0,0.0,0.0,0.0,0.0,357089600.0,0.0,0.0,0.0,0.0,120.0,2.0,52.0,1.0,1647518000.0,1.0
75%,363896100.0,55876.0,2375.0,1647605000.0,3017.0,1647605000.0,17.0,22.0,64240.0,0.0,64240.0,0.0,1460.0,7.0,0.0,0.0,0.0,363896100.0,0.0,0.0,0.0,0.0,240.0,4.0,266.0,2.0,1647605000.0,1.0
max,374376300.0,65535.0,65535.0,1648226000.0,119999.0,1648226000.0,58.0,255.0,65535.0,65535.0,65535.0,65535.0,58165.0,14.0,14.0,255.0,252.0,374376300.0,0.0,0.0,0.0,0.0,1670406000.0,1114854.0,3547208000.0,3004436.0,1648226000.0,1.0


In [16]:
# Analyze alerts distribution
data_all['ALERT'].value_counts()

ALERT
Denial of Service    6425161
Port Scanning        4170392
Malware                  571
Name: count, dtype: int64

In [17]:
# Analyze anomaly distribution
data_all['ANOMALY'].value_counts()

ANOMALY
1.0    3036969
0.0    2905979
Name: count, dtype: int64

In [18]:
# Protocols
data_all['PROTOCOL'].value_counts()

PROTOCOL
6     19074545
17     9604576
1      1463362
47        4800
58         688
57         284
41           3
Name: count, dtype: int64

In [19]:
# How big is the network? Grasp the number of unique IP addresses
# and their communication volumes
ip_activity = data_all['IPV4_SRC_ADDR'].value_counts()

ip_activity[ip_activity > 100]

IPV4_SRC_ADDR
10.114.241.191    6425161
10.114.241.166    4170392
10.114.225.205    2698888
10.114.225.206    2680707
10.114.226.5      2427748
                   ...   
159.65.51.254         101
122.51.149.86         101
192.241.157.63        101
81.91.233.42          101
213.136.79.205        101
Name: count, Length: 5035, dtype: int64

Ok, 115k unique IP addresses, 5k with over 100 flows sent. 10.114.241.0/24 seems to be an internal network due to the emount of traffic. Let's filter out those files which communicate above some threshold.

In [20]:
# How big is the internal network - let's scan the internal address spaces
print(len([key for key in ip_activity.keys() if key.startswith('10.')]))
print(len([key for key in ip_activity.keys() if key.startswith('172.16.')]))
print(len([key for key in ip_activity.keys() if key.startswith('192.168.')]))

265
458
105


Based on these findings, we can conclude that the network is large (+500 hosts), although NATTing can play a big role as well. Nevertheless, for the analysis purpose, the network is classified as large due to the number of IP addresses present nevertheless.

## Temporal and Continuity Analysis

Consider `LAST_SWITCHED` as the timestamp of the flow, disregarding its potential duration for this analysis.

In [21]:
data_all['LAST_SWITCHED'].head()

0    1647344604
1    1647344609
2    1647344614
3    1647344619
4    1647344644
Name: LAST_SWITCHED, dtype: int64

In [22]:
# Looks like epoch timestamps in a second precision, convert them to datetime format.
timestamps = pd.to_datetime(data_all['LAST_SWITCHED'], origin='unix', unit='s').sort_values(ascending=False)
timestamps

570     2022-03-25 16:30:17
567     2022-03-25 16:25:17
137     2022-03-25 16:25:17
566     2022-03-25 16:20:17
559     2022-03-25 16:15:17
                ...        
24603   2022-03-15 07:36:57
24602   2022-03-15 07:36:57
24601   2022-03-15 07:36:57
24600   2022-03-15 07:36:57
21869   2022-03-15 07:36:57
Name: LAST_SWITCHED, Length: 30148258, dtype: datetime64[ns]

In [23]:
# Compute the overall span
total_span = timestamps.iloc[0] - timestamps.iloc[-1]
total_span

Timedelta('10 days 08:53:20')

In [24]:
# Maximum gap to consider the file as continous - already quite long
total_span * 0.01

Timedelta('0 days 02:29:20')

In [25]:
# Measure capture duration and continousness
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

In [26]:
# We suspect that the capture will be discontinuous due to several files, let's see the analysis.
measure_real_capture_dur(timestamps, int(total_span.total_seconds() * 0.01))

(Timedelta('5 days 01:36:31'),
 False,
 [Timedelta('2 days 08:52:32'),
  Timedelta('0 days 23:27:01'),
  Timedelta('0 days 11:31:56'),
  Timedelta('0 days 09:52:05'),
  Timedelta('0 days 08:30:12'),
  Timedelta('0 days 04:06:09'),
  Timedelta('0 days 03:19:06'),
  Timedelta('0 days 02:56:52'),
  Timedelta('0 days 00:43:05'),
  Timedelta('0 days 00:17:33')])

Since there are less continuous blocks than the number of files, some files must have their gaps between different attacks less than 2h.

In [27]:
# Analyze if individual captures are continuous. Set maximum gap as 1% of 10 hours == 360 seconds
for file in sorted(ALL_CSVS):
    filepath = os.path.join(PATH_ROOT, file)
    data_partial = pd.read_csv(filepath)

    timestamps_partial = pd.to_datetime(data_partial['LAST_SWITCHED'], unit='s', origin='unix')
    timestamps_partial = timestamps_partial.sort_values(ascending=False)

    print(f"{file} : {measure_real_capture_dur(timestamps_partial, 360)}")

dos-03-15-2022-15-44-32.csv : (Timedelta('0 days 00:16:38'), True, [Timedelta('0 days 00:16:38')])
dos-03-16-2022-13-45-18.csv : (Timedelta('0 days 00:04:36'), True, [Timedelta('0 days 00:04:36')])
dos-03-17-2022-16-22-53.csv : (Timedelta('0 days 01:59:54'), False, [Timedelta('0 days 00:59:57'), Timedelta('0 days 00:59:57')])
dos-03-18-2022-19-27-05.csv : (Timedelta('0 days 03:59:47'), False, [Timedelta('0 days 00:59:57'), Timedelta('0 days 00:59:57'), Timedelta('0 days 00:59:57'), Timedelta('0 days 00:59:56')])
dos-03-19-2022-20-01-53.csv : (Timedelta('0 days 01:59:53'), False, [Timedelta('0 days 00:59:57'), Timedelta('0 days 00:59:56')])
dos-03-20-2022-14-27-54.csv : (Timedelta('0 days 01:50:58'), False, [Timedelta('0 days 00:59:56'), Timedelta('0 days 00:51:02')])
malware-03-25-2022-17-57-07.csv : (Timedelta('0 days 17:30:33'), False, [Timedelta('0 days 02:15:09'), Timedelta('0 days 01:55:55'), Timedelta('0 days 00:55:00'), Timedelta('0 days 00:24:53'), Timedelta('0 days 00:22:07'),

Therefore, we consider the dataset as discontinous.