# ROSIDS23 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
PATH_DATASET = '/data/surv/rosids23/ROSIDS23.csv'
PATH_PCAPS = '/data/surv/rosids23/pcaps'
PATH_PROCESSED = '/data/surv/rosids23/processed'

# PCAP Analysis

In [3]:
# Analysis of one single file
!capinfos $PATH_PCAPS/noattack.pcap

File name:           /data/surv/rosids23/pcaps//noattack.pcap
File type:           Wireshark/tcpdump/... - pcap
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: 96 bytes
Packet size limit:   inferred: 96 bytes
Number of packets:   38 M
File size:           3,466 MB
Data size:           5,010 MB
Capture duration:    14410.630539 seconds
First packet time:   2023-07-07 13:10:03.864049
Last packet time:    2023-07-07 17:10:14.494588
Data byte rate:      347 kBps
Data bit rate:       2,781 kbps
Average packet size: 129.18 bytes
Average packet rate: 2,691 packets/s
SHA256:              e4399eb21ee7d258df946301c21d0a7b81614312c34f872bc42ac3f29bf8b591
RIPEMD160:           406bcc88def22464391efa97dc15cfa624515c76
SHA1:                3a839688fe0046f21bfbde7ea00d9f0205de7042
Strict time order:   True
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether)
                     Capture len

In [4]:
for file in os.listdir(PATH_PCAPS):
    fullpath = os.path.join(PATH_PCAPS, file)
    !echo $file
    !capinfos -a -e $fullpath

unauthorizedpublisher.pcap
File name:           /data/surv/rosids23/pcaps/unauthorizedpublisher.pcap
Packet size limit:   inferred: 96 bytes
First packet time:   2023-07-12 09:28:19.537701
Last packet time:    2023-07-12 10:49:26.144160
unauthorizedsubsriber.pcap
File name:           /data/surv/rosids23/pcaps/unauthorizedsubsriber.pcap
Packet size limit:   inferred: 96 bytes
First packet time:   2023-07-12 11:26:26.789501
Last packet time:    2023-07-12 12:48:03.420217
subcriberflood.pcap
File name:           /data/surv/rosids23/pcaps/subcriberflood.pcap
Packet size limit:   inferred: 96 bytes
First packet time:   2023-08-15 13:53:26.670436
Last packet time:    2023-08-15 15:28:33.980567
DoS.pcap
File name:           /data/surv/rosids23/pcaps/DoS.pcap
Packet size limit:   inferred: 96 bytes
First packet time:   2023-07-12 15:46:26.101907
Last packet time:    2023-07-12 17:35:28.707275
noattack.pcap
File name:           /data/surv/rosids23/pcaps/noattack.pcap
Packet size limit:   inferr

Apparently, the captures are not overalapping. Sum of their durations is thus the expected duration value.

In [5]:
!/data/AAAA_SCRIPTS/countpkts.sh $PATH_PCAPS

Processing /data/surv/rosids23/pcaps//DoS.pcap
Processing /data/surv/rosids23/pcaps//noattack.pcap
Processing /data/surv/rosids23/pcaps//subcriberflood.pcap
Processing /data/surv/rosids23/pcaps//unauthorizedpublisher.pcap
Processing /data/surv/rosids23/pcaps//unauthorizedsubsriber.pcap

TOTAL FILES : 5
TOTAL PACKETS: 176318001


In [6]:
!/data/AAAA_SCRIPTS/countdur.sh $PATH_PCAPS

36423.783213


In [7]:
import math

display(math.floor(36423 / 3600))
display((36423 - (10 * 3600)) / 60)

10

7.05

10h 7m duration based on the analyzed PCAPs.

# CSV Analysis

In [8]:
data = pd.read_csv(PATH_DATASET)
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136681 entries, 0 to 136680
Data columns (total 84 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Flow ID            136681 non-null  object 
 1   Src IP             136681 non-null  object 
 2   Src Port           136681 non-null  int64  
 3   Dst IP             136681 non-null  object 
 4   Dst Port           136681 non-null  int64  
 5   Protocol           136681 non-null  int64  
 6   Timestamp          136681 non-null  object 
 7   Flow Duration      136681 non-null  int64  
 8   Tot Fwd Pkts       136681 non-null  int64  
 9   Tot Bwd Pkts       136681 non-null  int64  
 10  TotLen Fwd Pkts    136681 non-null  float64
 11  TotLen Bwd Pkts    136681 non-null  float64
 12  Fwd Pkt Len Max    136681 non-null  float64
 13  Fwd Pkt Len Min    136681 non-null  float64
 14  Fwd Pkt Len Mean   136681 non-null  float64
 15  Fwd Pkt Len Std    136681 non-null  float64
 16  Bw

In [9]:
len(data)

136681

In [10]:
data.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,192.168.3.4-192.168.3.6-11311-60792-6,192.168.3.6,60792,192.168.3.4,11311,6,07/07/2023 02:10:23 PM,6260,5,5,60.0,30.0,30.0,0.0,12.0,16.431677,30.0,0.0,6.0,13.416408,14376.996805,1597.444089,695.555556,762.894342,2086.0,0.0,4068.0,1017.0,1248.414728,2807.0,2.0,4998.0,1249.5,1027.437103,2191.0,354.0,0,0,0,0,160,176,798.722045,798.722045,0.0,30.0,8.181818,14.012981,196.363636,0,1,0,0,0,0,0,0,1.0,9.0,12.0,6.0,0,0,0,0,0,0,5,60,5,30,-1,507,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1,192.168.3.4-192.168.3.6-11311-60794-6,192.168.3.6,60794,192.168.3.4,11311,6,07/07/2023 02:10:23 PM,5903,5,5,60.0,30.0,30.0,0.0,12.0,16.431677,30.0,0.0,6.0,13.416408,15246.484838,1694.053871,655.888889,692.261592,2117.0,1.0,3602.0,900.5,1313.40537,2853.0,2.0,5146.0,1286.5,761.237808,2117.0,358.0,0,0,0,0,160,176,847.026935,847.026935,0.0,30.0,8.181818,14.012981,196.363636,0,1,0,0,0,0,0,0,1.0,9.0,12.0,6.0,0,0,0,0,0,0,5,60,5,30,-1,507,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
2,192.168.3.4-192.168.3.6-11311-39922-6,192.168.3.6,39922,192.168.3.4,11311,6,07/07/2023 02:10:32 PM,4523,5,5,60.0,30.0,30.0,0.0,12.0,16.431677,30.0,0.0,6.0,13.416408,19898.29759,2210.921954,502.555556,765.772994,2344.0,1.0,3516.0,879.0,1237.081242,2712.0,1.0,3710.0,927.5,1141.79464,2344.0,1.0,0,0,0,0,160,176,1105.460977,1105.460977,0.0,30.0,8.181818,14.012981,196.363636,0,1,0,0,0,0,0,0,1.0,9.0,12.0,6.0,0,0,0,0,0,0,5,60,5,30,-1,507,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
3,192.168.3.4-192.168.3.6-11311-55266-6,192.168.3.6,55266,192.168.3.4,11311,6,07/07/2023 02:11:11 PM,5191,5,5,60.0,30.0,30.0,0.0,12.0,16.431677,30.0,0.0,6.0,13.416408,17337.699865,1926.411096,576.777778,787.232459,2294.0,1.0,3905.0,976.25,1424.821714,3021.0,1.0,3946.0,986.5,1166.99343,2294.0,1.0,0,0,0,0,160,176,963.205548,963.205548,0.0,30.0,8.181818,14.012981,196.363636,0,1,0,0,0,0,0,0,1.0,9.0,12.0,6.0,0,0,0,0,0,0,5,60,5,30,-1,507,2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
4,192.168.3.6-192.168.3.7-43770-11111-6,192.168.3.7,11111,192.168.3.6,43770,6,07/07/2023 02:10:03 PM,72625778,2200,2212,41320.0,18526.0,30.0,0.0,18.781818,11.119739,30.0,0.0,8.375226,2.73647,824.032481,60.74978,16464.696894,12108.687938,45440.0,0.0,72609410.0,33019.286039,18505.540771,78260.0,367.0,72625778.0,32847.479873,14518.434559,59288.0,0.0,0,1,0,0,70400,70784,30.292274,30.457505,0.0,30.0,13.568094,9.618526,92.516044,0,0,0,1,1,0,0,0,1.0,13.57117,18.781818,8.375226,0,0,0,0,0,0,2200,41320,2212,18526,-1,502,2176,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign


In [11]:
# Why the hell are timestamps in two different formats?
data['Timestamp']

0         07/07/2023 02:10:23 PM
1         07/07/2023 02:10:23 PM
2         07/07/2023 02:10:32 PM
3         07/07/2023 02:11:11 PM
4         07/07/2023 02:10:03 PM
                   ...          
136676       2023-07-12 13:22:46
136677       2023-07-12 12:35:44
136678       2023-07-12 13:19:44
136679       2023-07-12 12:52:31
136680       2023-07-12 13:35:21
Name: Timestamp, Length: 136681, dtype: object

In [12]:
# Label distribution
data['Label'].value_counts()

Label
Benign       62511
DoS          31000
Subflood     30064
UnauthPub     7817
UnauthSub     5289
Name: count, dtype: int64

In [13]:
# Show the number of active endpoints within the topology
data['Src IP'].value_counts()

Src IP
192.168.3.10    69421
192.168.3.4     39267
192.168.3.6     26690
192.168.3.7       999
8.6.0.1           293
192.168.3.14       11
Name: count, dtype: int64

### Time Duration Analysis

In [14]:
# Since there are two different formats, let's fallback to default conversion with an error
tstamps = pd.to_datetime(data['Timestamp'])#, format='%d/%m/%Y %I:%M:%S %p')
tstamps = tstamps.sort_values(ascending=False).reset_index(drop=True)
tstamps

  tstamps = pd.to_datetime(data['Timestamp'])#, format='%d/%m/%Y %I:%M:%S %p')


0        2023-08-15 16:28:32
1        2023-08-15 16:28:32
2        2023-08-15 16:28:32
3        2023-08-15 16:28:32
4        2023-08-15 16:28:32
                 ...        
136676   2023-07-07 14:10:03
136677   2023-07-07 14:10:03
136678   2023-07-07 14:10:03
136679   2023-07-07 14:10:03
136680   2023-07-07 14:10:03
Name: Timestamp, Length: 136681, dtype: datetime64[ns]

In [15]:
def measure_real_capture_dur(data: pd.Series, gap_minutes: int = 5) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    gap_max = pd.Timedelta(minutes=gap_minutes)
    total_dur = pd.Timedelta(seconds=0)
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap < gap_max:
            total_dur += dur_gap
        elif contiguous:
            contiguous = False

        last_tstamp = cur_tstamp

    return total_dur, contiguous

In [16]:
# 6 minute-gap roughly corresponds to the 1% of 10 hour capture span.
measure_real_capture_dur(tstamps, 6)

(Timedelta('0 days 10:06:58'), False)

This capture duration and non-continousness roughly corresponds to the PCAP analysis.