# FLNET2023 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
BASE_FOLDER = '/data/ciciot23/flnet2023/'

### Preparation

In [3]:
def normalize_names(folder: str, new_prefix: str) -> None:
    """Normalization of all data files within a directory by prefixing a certain string to their filenames.
    Performed for easier handling and ability to load by-event (e.g., certain attacks)"""
    # Find out files in the given directory
    files = os.listdir(folder)

    # Go through determined files and rename them accordingly
    for fname in files:
        if os.path.isfile(os.path.join(folder, fname)):
            # Split extension and the file parts by '-'
            fpath, fext = os.path.splitext(fname)
            fpath_parts = fpath.split('-')

            # Search for a number a normalize into two digits
            for idx in range(len(fpath_parts)):
                if fpath_parts[idx].isnumeric():
                    fpath_parts[idx] = f'{int(fpath_parts[idx]):>02}'

            # Rename 1st part with the desired prefix
            if fpath_parts[0] == 'Dataset':
                fpath_parts[0] = new_prefix

            # Create a new file name by merging the file path parts back together
            new_fname = '-'.join(fpath_parts) + fext

            fullpath_old = os.path.join(folder, fname)
            fullpath_new = os.path.join(folder, new_fname)

            os.rename(fullpath_old, fullpath_new)


In [4]:
normalize_names(BASE_FOLDER + '/DoS/CSV', 'dos')
normalize_names(BASE_FOLDER + '/DoS/PCAP', 'dos')

In [5]:
normalize_names(BASE_FOLDER + '/Infiltration/CSV', 'infilt')
normalize_names(BASE_FOLDER + '/Infiltration/PCAP', 'infilt')

In [6]:
normalize_names(BASE_FOLDER + '/Normal/CSV', 'normal')
normalize_names(BASE_FOLDER + '/Normal/PCAP', 'normal')

## CSV Files Analysis

In [9]:
FOLDER_CSV_TRAIN = '/data/ciciot23/flnet2023/train/csv/'

In [10]:
# Are data properly discriminated? Load 2-3 random files to see label distribution
data = pd.read_csv(FOLDER_CSV_TRAIN + 'dos-01.csv')

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10590 entries, 0 to 10589
Data columns (total 83 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   src_ip             10590 non-null  object 
 1   dst_ip             10590 non-null  object 
 2   src_port           10590 non-null  int64  
 3   dst_port           10590 non-null  int64  
 4   protocol           10590 non-null  int64  
 5   timestamp          10590 non-null  object 
 6   flow_duration      10590 non-null  int64  
 7   flow_byts_s        10590 non-null  float64
 8   flow_pkts_s        10590 non-null  float64
 9   fwd_pkts_s         10590 non-null  float64
 10  bwd_pkts_s         10590 non-null  float64
 11  tot_fwd_pkts       10590 non-null  int64  
 12  tot_bwd_pkts       10590 non-null  int64  
 13  totlen_fwd_pkts    10590 non-null  int64  
 14  totlen_bwd_pkts    10590 non-null  int64  
 15  fwd_pkt_len_max    10590 non-null  int64  
 16  fwd_pkt_len_min    105

In [12]:
data['label'].value_counts()

label
DoS-slowhttp    10590
Name: count, dtype: int64

In [16]:
data = pd.read_csv(FOLDER_CSV_TRAIN + 'infilt-04.csv')
data['label'].value_counts()

label
Infiltration-mitm    2216
Name: count, dtype: int64

In [17]:
data = pd.read_csv(FOLDER_CSV_TRAIN + 'normal-10.csv')
data['label'].value_counts()

label
Normal    446469
Name: count, dtype: int64

Ok, this seems fine. Load all at once.

In [20]:
data_train = pd.DataFrame()

for csv_file in os.listdir(FOLDER_CSV_TRAIN):
    data_train = pd.concat([data_train, pd.read_csv(os.path.join(FOLDER_CSV_TRAIN, csv_file))])

In [21]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5416155 entries, 0 to 4776
Data columns (total 83 columns):
 #   Column             Dtype  
---  ------             -----  
 0   src_ip             object 
 1   dst_ip             object 
 2   src_port           int64  
 3   dst_port           int64  
 4   protocol           int64  
 5   timestamp          object 
 6   flow_duration      int64  
 7   flow_byts_s        float64
 8   flow_pkts_s        float64
 9   fwd_pkts_s         float64
 10  bwd_pkts_s         float64
 11  tot_fwd_pkts       int64  
 12  tot_bwd_pkts       int64  
 13  totlen_fwd_pkts    int64  
 14  totlen_bwd_pkts    int64  
 15  fwd_pkt_len_max    int64  
 16  fwd_pkt_len_min    int64  
 17  fwd_pkt_len_mean   float64
 18  fwd_pkt_len_std    float64
 19  bwd_pkt_len_max    int64  
 20  bwd_pkt_len_min    int64  
 21  bwd_pkt_len_mean   float64
 22  bwd_pkt_len_std    float64
 23  pkt_len_max        int64  
 24  pkt_len_min        int64  
 25  pkt_len_mean       float64

In [23]:
data_train.head()

Unnamed: 0,src_ip,dst_ip,src_port,dst_port,protocol,timestamp,flow_duration,flow_byts_s,flow_pkts_s,fwd_pkts_s,...,fwd_blk_rate_avg,bwd_blk_rate_avg,fwd_seg_size_avg,bwd_seg_size_avg,cwe_flag_count,subflow_fwd_pkts,subflow_bwd_pkts,subflow_fwd_byts,subflow_bwd_byts,label
0,10.0.51.2,10.0.48.1,57136,5050,2048,2023-05-13 15:48:13,3807,26039400.0,35198.318886,25216.706068,...,28421690.0,0.0,1001.6875,78.157895,0,96,38,96162,2970,DDoS-bot
1,10.0.51.2,10.0.48.1,57164,5050,2048,2023-05-13 15:48:13,5312,16632910.0,24472.891566,17319.277108,...,21030030.0,0.0,928.086957,78.157895,0,92,38,85384,2970,DDoS-bot
2,10.0.51.2,10.0.48.1,57184,5050,2048,2023-05-13 15:48:13,6463,14243540.0,20423.951725,14544.329259,...,14679090.0,0.0,947.723404,78.157895,0,94,38,89086,2970,DDoS-bot
3,10.0.52.2,10.0.48.1,59040,5050,2048,2023-05-13 15:48:14,2165,42921020.0,60969.976905,43418.013857,...,210503800.0,0.0,956.957447,78.157895,0,94,38,89954,2970,DDoS-bot
4,10.0.52.2,10.0.48.1,59048,5050,2048,2023-05-13 15:48:14,2256,43178190.0,59397.163121,42553.191489,...,125959900.0,0.0,983.75,78.157895,0,96,38,94440,2970,DDoS-bot


In [22]:
# Label distribution
data_train['label'].value_counts()

label
Normal                   1922529
DoS-hulk                 1644381
DDoS-tcp                  930798
DDoS-stomp                389540
DDoS-dyn                  322920
DDoS-bot                   87808
DoS-slowhttp               86226
Infiltration-mitm          28113
Web-xss                     3069
Web-sql-injection            441
Web-command-injection        330
Name: count, dtype: int64

In [32]:
# Measure a timespan of the data capture
tstamps = pd.to_datetime(data_train['timestamp'])

tstampts_sorted = tstamps.sort_values()

(tstampts_sorted.iloc[-1] - tstampts_sorted.iloc[0])

Timedelta('20 days 05:40:42')

The span of the capture is 20d 5h 40m, let's PCAPs will tell us bigger details.

## Load the Test Set

In [33]:
TEST_SET_BASE_DIR = '/data/ciciot23/flnet2023/test/csv'

In [35]:
test_data_all = pd.DataFrame()

for csv_file in os.listdir(TEST_SET_BASE_DIR):
    test_data_all = pd.concat([test_data_all, pd.read_csv(os.path.join(TEST_SET_BASE_DIR, csv_file))])

test_data_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 746617 entries, 0 to 302080
Data columns (total 83 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   src_ip             746617 non-null  object 
 1   dst_ip             746617 non-null  object 
 2   src_port           746617 non-null  int64  
 3   dst_port           746617 non-null  int64  
 4   protocol           746617 non-null  int64  
 5   timestamp          746617 non-null  object 
 6   flow_duration      746617 non-null  int64  
 7   flow_byts_s        746617 non-null  float64
 8   flow_pkts_s        746617 non-null  float64
 9   fwd_pkts_s         746617 non-null  float64
 10  bwd_pkts_s         746617 non-null  float64
 11  tot_fwd_pkts       746617 non-null  int64  
 12  tot_bwd_pkts       746617 non-null  int64  
 13  totlen_fwd_pkts    746617 non-null  int64  
 14  totlen_bwd_pkts    746617 non-null  int64  
 15  fwd_pkt_len_max    746617 non-null  int64  
 16  fwd_pkt

In [37]:
test_data_all['label'].value_counts()

label
DDoS-tcp                 302081
Normal                   189836
DoS-hulk                 108345
DDoS-stomp                92286
DDoS-bot                  22667
DDoS-dyn                  15341
DoS-slowhttp              12092
Infiltration-mitm          1870
Web-xss                    1533
Web-command-injection       345
Web-sql-injection           221
Name: count, dtype: int64

Although timestamps are provided within the CSV data, we opted for capture duration and continousness analysis via PCAPs (`capinfos`).

## Analyze PCAP files - Train

In [38]:
PCAP_TRAIN_DIR = '/data/ciciot23/flnet2023/train/pcap'
PCAP_TEST_DIR = '/data/ciciot23/flnet2023/test/pcap'

In [43]:
def analyze_dir(dir: str):
    for fname in sorted(os.listdir(dir)):
        fullpath = os.path.join(dir, fname)

        !capinfos -a -e -c -M $fullpath

        print()

In [44]:
analyze_dir(PCAP_TRAIN_DIR)

File name:           /data/ciciot23/flnet2023/train/pcap/ddos-01-bot.pcap
Number of packets:   23978351
First packet time:   2023-05-13 23:48:11.278016236
Last packet time:    2023-05-13 23:53:17.204687976

File name:           /data/ciciot23/flnet2023/train/pcap/ddos-05-stomp.pcap
Number of packets:   44138355
First packet time:   2023-05-14 00:32:23.174696261
Last packet time:    2023-05-14 00:37:31.590235024

File name:           /data/ciciot23/flnet2023/train/pcap/ddos-09-dyn.pcap
Number of packets:   35595302
First packet time:   2023-05-14 01:35:17.562842961
Last packet time:    2023-05-14 01:45:22.453138179

File name:           /data/ciciot23/flnet2023/train/pcap/ddos-10-tcp.pcap
Number of packets:   107193786
First packet time:   2023-05-14 02:00:18.383568109
Last packet time:    2023-05-14 02:05:24.824837909

File name:           /data/ciciot23/flnet2023/train/pcap/dos-01.pcap
Number of packets:   232940
First packet time:   2023-04-27 18:48:44.589418
Last packet time:    202

In [47]:
# Recheck on the cleared folder
TRAIN_PCAPS_CLEARED = '/data/ciciot23/flnet2023/train/pcaps_analysis'

analyze_dir(TRAIN_PCAPS_CLEARED)

File name:           /data/ciciot23/flnet2023/train/pcaps_analysis/dos-01.pcap
Number of packets:   232940
First packet time:   2023-04-27 18:48:44.589418
Last packet time:    2023-04-27 19:26:02.879784

File name:           /data/ciciot23/flnet2023/train/pcaps_analysis/dos-07-hulk.pcap
Number of packets:   1704896
First packet time:   2023-05-06 16:01:02.369564
Last packet time:    2023-05-06 16:02:36.255660

File name:           /data/ciciot23/flnet2023/train/pcaps_analysis/infilt-02.pcap
Number of packets:   343009
First packet time:   2023-05-04 21:46:12.841796
Last packet time:    2023-05-04 22:39:17.412311

File name:           /data/ciciot23/flnet2023/train/pcaps_analysis/infilt-03.pcap
Number of packets:   426611
First packet time:   2023-05-04 21:15:32.662106
Last packet time:    2023-05-04 21:41:00.639092

File name:           /data/ciciot23/flnet2023/train/pcaps_analysis/normal-01.pcap
Number of packets:   602804
First packet time:   2023-05-11 23:52:49.026487
Last packet ti

In [61]:
# Overall number of seconds from non-overlapping captures from an external scripts: (countdur.sh available on GitHub)
import datetime

secs = 524247.257416225

display(datetime.timedelta(seconds=secs))

datetime.timedelta(days=6, seconds=5847, microseconds=257416)

In [62]:
5847 / 3600

1.6241666666666668

## Peek on the Test PCAP DATA

In [63]:
TEST_PCAP_DIR = '/data/ciciot23/flnet2023/test/pcap'

In [64]:
analyze_dir(TEST_PCAP_DIR)

File name:           /data/ciciot23/flnet2023/test/pcap/Dataset-TEST-BOT.pcap
Number of packets:   3698823
First packet time:   2023-05-14 02:36:56.439753326
Last packet time:    2023-05-14 02:38:00.458457467

File name:           /data/ciciot23/flnet2023/test/pcap/command-injection.pcap
Number of packets:   10083
First packet time:   2023-05-09 21:23:40.450106
Last packet time:    2023-05-09 21:25:50.576518

File name:           /data/ciciot23/flnet2023/test/pcap/dyn.pcap
Number of packets:   1983511
First packet time:   2023-05-14 02:48:36.627204847
Last packet time:    2023-05-14 02:51:52.669290990

File name:           /data/ciciot23/flnet2023/test/pcap/hulk.pcap
Number of packets:   10317291
First packet time:   2023-05-14 02:25:58.073095436
Last packet time:    2023-05-14 02:31:02.328689006

File name:           /data/ciciot23/flnet2023/test/pcap/mitm.pcap
Number of packets:   137076
First packet time:   2023-05-06 17:06:11.543402
Last packet time:    2023-05-06 17:26:53.885374



In [65]:
# Again, try in on the cleaned dir with selected PCAPs
analyze_dir('/data/ciciot23/flnet2023/test/pcap_cleaned')

File name:           /data/ciciot23/flnet2023/test/pcap_cleaned/mitm.pcap
Number of packets:   137076
First packet time:   2023-05-06 17:06:11.543402
Last packet time:    2023-05-06 17:26:53.885374

File name:           /data/ciciot23/flnet2023/test/pcap_cleaned/normal.pcap
Number of packets:   793723
First packet time:   2023-05-06 17:53:31.921323
Last packet time:    2023-05-18 00:50:52.879011028

File name:           /data/ciciot23/flnet2023/test/pcap_cleaned/slowHttp.pcap
Number of packets:   236950
First packet time:   2023-05-06 17:00:27.459738
Last packet time:    2023-05-06 17:04:39.520730

File name:           /data/ciciot23/flnet2023/test/pcap_cleaned/sql.pcap
Number of packets:   10552
First packet time:   2023-05-06 17:33:13.972102
Last packet time:    2023-05-06 17:34:01.989668

File name:           /data/ciciot23/flnet2023/test/pcap_cleaned/xss.pcap
Number of packets:   37036
First packet time:   2023-05-06 17:40:14.133221
Last packet time:    2023-05-06 17:40:24.136293



In [66]:
# After running the script of an external duration counting
dur_sec = 976993.381290028

datetime.timedelta(seconds=dur_sec)

datetime.timedelta(days=11, seconds=26593, microseconds=381290)

In [67]:
26593 / 3600

7.3869444444444445