# TII-SSRC-23 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
BASEDIR  = '/data/surv/tii_src_23'
CSV_DIR  = os.path.join(BASEDIR, 'csv')
PCAP_DIR = os.path.join(BASEDIR, 'pcap')

## PCAP Analysis

In [3]:
# Analyze one sample PCAP
!capinfos "$PCAP_DIR/malicious/dos/rst_tcp_dos.pcap"

File name:           /data/surv/tii_src_23/pcap//malicious/dos/rst_tcp_dos.pcap
File type:           Wireshark/tcpdump/... - pcap
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: 262144 bytes
Number of packets:   3,835 k
File size:           2,970 MB
Data size:           2,909 MB
Capture duration:    177897.058266 seconds
First packet time:   1970-01-01 01:51:32.722879
Last packet time:    1970-01-03 03:16:29.781145
Data byte rate:      16 kBps
Data bit rate:       130 kbps
Average packet size: 758.36 bytes
Average packet rate: 21 packets/s
SHA256:              1639e6ae280abc8d33f791c5a3862cfe4689ee26759ad81d1fd1dfa36ff11c0b
RIPEMD160:           62c2e3f6debb3365d462c4d2121f56f56e3be79c
SHA1:                1fb1ef7cb4e1294ba2a9bcfe0686986401f755d9
Strict time order:   True
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - ether)
                     Capture length = 262144
       

As apparent, timestamps are removed from the capture - points towards 1970 -- epoch start. This trend appear in all PCAPs, as well as in extracted CICFlowMeter flow files.

It is not stated in the documentation, but let's consider that the captures are not overlapping. In such case, we can simply sum all the durations together. Use a script for that.

In [4]:
# Count the number of packets
!/data/AAAA_SCRIPTS/countpkts.sh $PCAP_DIR

Processing /data/surv/tii_src_23/pcap/benign/audio/audio.pcap
Processing /data/surv/tii_src_23/pcap/benign/background/background.pcap
Processing /data/surv/tii_src_23/pcap/benign/text/text.pcap
Processing /data/surv/tii_src_23/pcap/benign/video/http.pcap
Processing /data/surv/tii_src_23/pcap/benign/video/rtp.pcap
Processing /data/surv/tii_src_23/pcap/benign/video/udp.pcap
Processing /data/surv/tii_src_23/pcap/malicious/bruteforce/bruteforce_dns.pcap
Processing /data/surv/tii_src_23/pcap/malicious/bruteforce/bruteforce_ftp.pcap
Processing /data/surv/tii_src_23/pcap/malicious/bruteforce/bruteforce_http.pcap
Processing /data/surv/tii_src_23/pcap/malicious/bruteforce/bruteforce_ssh.pcap
Processing /data/surv/tii_src_23/pcap/malicious/bruteforce/bruteforce_telnet.pcap
Processing /data/surv/tii_src_23/pcap/malicious/dos/ack_tcp_dos.pcap
Processing /data/surv/tii_src_23/pcap/malicious/dos/cwr_tcp_dos.pcap
Processing /data/surv/tii_src_23/pcap/malicious/dos/ecn_tcp_dos.pcap
Processing /data/su

In [5]:
# Taken from:
# https://stackoverflow.com/questions/4048651/function-to-convert-seconds-into-minutes-hours-and-days

def normalize_seconds(seconds: int) -> tuple:
    (days, remainder) = divmod(seconds, 86400)
    (hours, remainder) = divmod(remainder, 3600)
    (minutes, seconds) = divmod(remainder, 60)

    return int(days), int(hours), int(minutes), seconds


In [6]:
# Sum all durations within PCAP files
total_dur_secs = !/data/AAAA_SCRIPTS/countdur.sh $PCAP_DIR
total_dur_secs

2721988.812781


In [7]:
normalize_seconds(total_dur_secs)

(31, 12, 6, 28.812781000044197)

Based on the PCAP analysis and an assumption of non-overlapping PCAPs, the capture lasts for 31 days and 12 hours. However, we should get a better estimate from CSV analysis, as it offers the `Timestamp` feature.

## CSV Analysis

In [8]:
DATAPATH = os.path.join(CSV_DIR, 'data.csv')
DATAPATH

'/data/surv/tii_src_23/csv/data.csv'

In [9]:
data = pd.read_csv(DATAPATH)

In [10]:
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8656767 entries, 0 to 8656766
Data columns (total 86 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   Flow ID                     8656767 non-null  object 
 1   Src IP                      8656767 non-null  object 
 2   Src Port                    8656767 non-null  float64
 3   Dst IP                      8656767 non-null  object 
 4   Dst Port                    8656767 non-null  int64  
 5   Protocol                    8656767 non-null  float64
 6   Timestamp                   8656767 non-null  object 
 7   Flow Duration               8656767 non-null  float64
 8   Total Fwd Packet            8656767 non-null  float64
 9   Total Bwd packets           8656767 non-null  float64
 10  Total Length of Fwd Packet  8656767 non-null  float64
 11  Total Length of Bwd Packet  8656767 non-null  float64
 12  Fwd Packet Length Max       8656767 non-null  float64
 1

In [11]:
data.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWR Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Traffic Type,Traffic Subtype
0,192.168.1.90-192.168.1.3-53930-64738-6,192.168.1.90,53930.0,192.168.1.3,64738,6.0,01/01/1970 07:41:46 AM,52601173.0,1701.0,1793.0,149935.0,140566.0,1318.0,0.0,88.145209,99.784742,1348.0,0.0,78.3971,87.189453,5522.709541,66.424374,15059.02,97621.75,3252689.0,-73.0,52600247.0,30941.32,139416.2,3257759.0,21.0,52600977.0,29353.22,136478.3,3256370.0,123.0,0.0,0.0,0.0,0.0,54464.0,57384.0,32.337682,34.086692,0.0,1348.0,83.119027,93.64393,8769.1856,2.0,2.0,0.0,2416.0,3493.0,0.0,0.0,0.0,1.0,83.142816,88.145209,78.3971,0.0,0.0,0.0,72603.0,603.0,7187.0,340.0,29987.0,358.0,28113.0,65280.0,2068.0,1206.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
1,192.168.1.3-192.168.1.90-64738-37700-6,192.168.1.3,64738.0,192.168.1.90,37700,6.0,01/01/1970 07:41:46 AM,119106942.0,36.0,57.0,1280.0,1650.0,41.0,0.0,35.555556,13.010862,66.0,0.0,28.947368,33.041328,24.599742,0.780811,1294641.0,2083523.0,5003516.0,96.0,119066998.0,3401914.0,2105502.0,5008354.0,235.0,119095444.0,2126704.0,2320660.0,5003516.0,2134.0,1.0,0.0,0.0,0.0,1164.0,1836.0,0.302249,0.478562,0.0,66.0,31.606383,27.063135,732.413292,0.0,0.0,0.0,57.0,93.0,0.0,0.0,0.0,1.0,31.946237,35.555556,28.947368,0.0,0.0,0.0,182.0,4.0,157.0,1.0,49.0,2.0,63.0,502.0,502.0,31.0,32.0,18023270.0,3416174.0,19996926.0,14078617.0,5001511.0,1737.400069,5003516.0,5000449.0,Benign,Audio,Audio
2,192.168.1.3-192.168.1.90-22-40854-6,192.168.1.3,22.0,192.168.1.90,40854,6.0,01/01/1970 07:41:46 AM,5589.0,1.0,1.0,124.0,0.0,124.0,124.0,124.0,0.0,0.0,0.0,0.0,0.0,22186.437645,357.845768,5589.0,0.0,5589.0,5589.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,32.0,32.0,178.922884,178.922884,0.0,124.0,82.666667,71.591433,5125.333333,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0,124.0,124.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,501.0,502.0,0.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
3,192.168.1.70-192.168.1.3-55422-64738-6,192.168.1.70,55422.0,192.168.1.3,64738,6.0,01/01/1970 07:41:47 AM,118166562.0,3932.0,4196.0,403875.0,301478.0,1348.0,0.0,102.714903,164.095546,1348.0,0.0,71.848904,94.529957,5969.1421,68.784264,14540.0,84243.33,3663220.0,-497.0,118166562.0,30060.18,123372.2,3663290.0,0.0,118141369.0,28162.42,116725.2,3663475.0,55.0,0.0,0.0,0.0,0.0,125928.0,134508.0,33.275065,35.509199,0.0,1348.0,86.769959,133.693257,17873.887035,0.0,2.0,0.0,5329.0,8127.0,0.0,0.0,0.0,1.0,86.780635,102.714903,71.848904,0.0,0.0,0.0,100716.0,764.0,7317.0,393.0,40387.0,419.0,30147.0,65280.0,4708.0,2770.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio
4,192.168.1.90-192.168.1.3-59658-64738-17,192.168.1.90,59658.0,192.168.1.3,64738,17.0,01/01/1970 07:41:50 AM,119988385.0,25.0,6795.0,225.0,534811.0,9.0,9.0,9.0,0.0,121.0,9.0,78.706549,16.716483,4459.064934,56.838835,17596.18,105625.4,4685144.0,98.0,119988195.0,4999508.0,3117.113,5007045.0,4992305.0,119987788.0,17660.85,105817.8,4685411.0,98.0,0.0,0.0,0.0,0.0,200.0,54360.0,0.208354,56.630481,9.0,121.0,78.440844,17.228724,296.828941,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,271.0,78.452346,9.0,78.706549,0.0,0.0,0.0,76426.0,973.0,5507.0,2.0,25.0,755.0,59423.0,0.0,0.0,24.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign,Audio,Audio


In [12]:
len(data)

8656767

In [13]:
data['Label'].value_counts()

Label
Malicious    8655466
Benign          1301
Name: count, dtype: int64

In [14]:
data['Traffic Type'].value_counts()

Traffic Type
DoS                      7490929
Information Gathering    1038363
Mirai                      91002
Bruteforce                 35172
Video                        870
Text                         209
Audio                        190
Background                    32
Name: count, dtype: int64

In [15]:
data['Traffic Subtype'].value_counts()

Traffic Subtype
DoS RST                  1072504
Information Gathering    1038363
DoS ACK                   936307
DoS PSH                   909507
DoS URG                   906190
DoS CWR                   872523
DoS ECN                   871150
DoS SYN                   856764
DoS FIN                   725600
DoS UDP                   257994
DoS HTTP                   82351
Mirai DDoS DNS             55196
Bruteforce DNS             22179
Mirai DDoS SYN             14210
Mirai DDoS HTTP             8923
Mirai Scan Bruteforce       8731
Bruteforce Telnet           4913
Bruteforce SSH              3967
Mirai DDoS ACK              3779
Bruteforce FTP              3485
Bruteforce HTTP              628
Video HTTP                   376
Video RTP                    349
Text                         209
Audio                        190
Video UDP                    145
Mirai DDoS UDP                71
Mirai DDoS GREIP              49
Mirai DDoS GREETH             43
Background                 

In [16]:
# How many different attacks are there?
len(data['Traffic Subtype'].value_counts())

32

There are several benign traffic subtypes mixed into it as well. Select malicious subtypes only.

In [17]:
data[data['Label'] == 'Malicious']['Traffic Subtype'].value_counts()

Traffic Subtype
DoS RST                  1072504
Information Gathering    1038363
DoS ACK                   936307
DoS PSH                   909507
DoS URG                   906190
DoS CWR                   872523
DoS ECN                   871150
DoS SYN                   856764
DoS FIN                   725600
DoS UDP                   257994
DoS HTTP                   82351
Mirai DDoS DNS             55196
Bruteforce DNS             22179
Mirai DDoS SYN             14210
Mirai DDoS HTTP             8923
Mirai Scan Bruteforce       8731
Bruteforce Telnet           4913
Bruteforce SSH              3967
Mirai DDoS ACK              3779
Bruteforce FTP              3485
Bruteforce HTTP              628
Mirai DDoS UDP                71
Mirai DDoS GREIP              49
Mirai DDoS GREETH             43
DoS MAC                       30
DoS ICMP                       9
Name: count, dtype: int64

In [18]:
len(data[data['Label'] == 'Malicious']['Traffic Subtype'].value_counts())

26

### Duration Estimation From CSV

In [19]:
tstamps = pd.to_datetime(data['Timestamp']).sort_values(ascending=False).reset_index(drop=True)
tstamps


0         2023-08-02 04:17:06
1         2023-08-02 04:17:06
2         2023-08-02 04:17:06
3         2023-08-02 04:17:06
4         2023-08-02 04:17:06
                  ...        
8656762   1970-01-01 04:15:06
8656763   1970-01-01 04:15:06
8656764   1970-01-01 04:15:06
8656765   1970-01-01 04:15:06
8656766   1970-01-01 04:08:55
Name: Timestamp, Length: 8656767, dtype: datetime64[ns]

In [20]:
len(tstamps)

8656767

In [21]:
import datetime

len(tstamps[tstamps > datetime.datetime(year=2022, month=1, day=1)])

91002

Apparently, most of the data have incorrect (removed) timestamp. For the purpose of this analysis, we don't need to dive deep into which traffic types/subtypes are captured correctly. Nevertheless, contiguousness of the capture is nullified. Now, what is the actual capture duration?

In [22]:
# Compute timespan of the capture and the contiguity
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = 0
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap.seconds <= gap_max_secs:
            total_dur += dur_gap.seconds
        elif contiguous:
            contiguous = False

        last_tstamp = cur_tstamp

    return pd.Timedelta(seconds=total_dur), contiguous

In [23]:
measure_real_capture_dur(tstamps)

(Timedelta('0 days 13:26:50'), False)

Since many captures start with 1970-01-01, the scenarios surely overlap. Therefore, treating each PCAP as a separate scenario, and thus receiving the total capture duration of 31 days sounds more realistic than this result, although the truth will lay somewhere in between.