# CIC-DDoS2019 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

# Preparation

Here, we merge all the files into single CSVs. However, some might want to load the files one-by-one rather than merging them due to memory constraints.

In [2]:
# We have directories and a bunch of files per attack after extracting original files
ROOT_DIR = '/data/disk2/cic_ddos2019/'
DAY1_DIR = os.path.join(ROOT_DIR, '01-12')
DAY2_DIR = os.path.join(ROOT_DIR, '03-11')

# Our goal is to merge all attacks into individual CSVs
PATH_DAY1 = os.path.join(ROOT_DIR, '01-12.csv')
PATH_DAY2 = os.path.join(ROOT_DIR, '03-11.csv')

In [3]:
def create_single_file(folder: str, final_filepath: str) -> None:
    folder_files = os.listdir(folder)

    for id, file in enumerate(folder_files):
        file_fullpath = os.path.join(folder, file)

        # Create a file header (copy 1st line)
        if id == 0:
            !head -n 1 $file_fullpath > $final_filepath

        # Copy all but first line into the new file
        !tail -n +2 $file_fullpath >> $final_filepath

In [4]:
# Create 1 file per each day
create_single_file(DAY1_DIR, PATH_DAY1)
create_single_file(DAY2_DIR, PATH_DAY2)

## Day #1 (Train) Data

In [5]:
# Read the Day 1 data
data1 = pd.read_csv(PATH_DAY1)

  data1 = pd.read_csv(PATH_DAY1)


In [6]:
# How does the data look like?
data1.head(10)

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,186059,172.16.0.5-192.168.50.1-58445-4463-17,172.16.0.5,58445,192.168.50.1,4463,17,2018-12-01 13:04:45.928673,1,2,0,766.0,0.0,383.0,383.0,383.0,0.0,0.0,0.0,0.0,0.0,766000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,383.0,383.0,383.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,574.5,383.0,0.0,-2,0,0,0,0,0,0,2,766,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
1,135692,172.16.0.5-192.168.50.1-36908-9914-17,172.16.0.5,36908,192.168.50.1,9914,17,2018-12-01 13:04:45.928913,1,2,0,778.0,0.0,389.0,389.0,389.0,0.0,0.0,0.0,0.0,0.0,778000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,389.0,389.0,389.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,583.5,389.0,0.0,-2,0,0,0,0,0,0,2,778,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
2,33822,172.16.0.5-192.168.50.1-41727-32361-17,172.16.0.5,41727,192.168.50.1,32361,17,2018-12-01 13:04:45.928915,2,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,375000000.0,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,1000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,-2,0,0,0,0,0,0,2,750,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
3,24498,172.16.0.5-192.168.50.1-55447-5691-17,172.16.0.5,55447,192.168.50.1,5691,17,2018-12-01 13:04:45.929024,2,2,0,738.0,0.0,369.0,369.0,369.0,0.0,0.0,0.0,0.0,0.0,369000000.0,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,1000000.0,0.0,369.0,369.0,369.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,553.5,369.0,0.0,-2,0,0,0,0,0,0,2,738,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
4,117372,172.16.0.5-192.168.50.1-58794-56335-17,172.16.0.5,58794,192.168.50.1,56335,17,2018-12-01 13:04:45.929096,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,-2,0,0,0,0,0,0,2,750,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
5,250173,172.16.0.5-192.168.50.1-48002-57615-17,172.16.0.5,48002,192.168.50.1,57615,17,2018-12-01 13:04:45.929170,106422,4,0,1438.0,0.0,389.0,330.0,359.5,34.063666,0.0,0.0,0.0,0.0,13512.24,37.58621,35474.0,61441.038297,106420.0,1.0,106422.0,35474.0,61441.038297,106420.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-4,0,37.58621,0.0,330.0,389.0,353.6,32.315631,1044.3,0,0,0,0,0,0,0,0,0.0,442.0,359.5,0.0,-4,0,0,0,0,0,0,4,1438,0,0,-1,-1,3,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
6,170534,172.16.0.5-192.168.50.1-51718-34159-17,172.16.0.5,51718,192.168.50.1,34159,17,2018-12-01 13:04:45.929172,1,2,0,778.0,0.0,389.0,389.0,389.0,0.0,0.0,0.0,0.0,0.0,778000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,389.0,389.0,389.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,583.5,389.0,0.0,-2,0,0,0,0,0,0,2,778,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
7,223983,172.16.0.5-192.168.50.1-55448-12269-17,172.16.0.5,55448,192.168.50.1,12269,17,2018-12-01 13:04:45.929288,1,2,0,738.0,0.0,369.0,369.0,369.0,0.0,0.0,0.0,0.0,0.0,738000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,369.0,369.0,369.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,553.5,369.0,0.0,-2,0,0,0,0,0,0,2,738,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
8,333455,172.16.0.5-192.168.50.1-58793-36639-17,172.16.0.5,58793,192.168.50.1,36639,17,2018-12-01 13:04:45.929379,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,-2,0,0,0,0,0,0,2,750,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
9,23583,172.16.0.5-192.168.50.1-35276-36368-17,172.16.0.5,35276,192.168.50.1,36368,17,2018-12-01 13:04:45.929381,989278,20,0,7242.0,0.0,401.0,321.0,362.1,30.127623,0.0,0.0,0.0,0.0,7320.49,20.21676,52067.263158,56392.7458,112381.0,1.0,989278.0,52067.263158,56392.7458,112381.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-20,0,20.21676,0.0,321.0,401.0,360.142857,30.703885,942.728571,0,0,0,0,0,0,0,0,0.0,378.15,362.1,0.0,-20,0,0,0,0,0,0,20,7242,0,0,-1,-1,19,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag


In [5]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50063112 entries, 0 to 50063111
Data columns (total 88 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   Unnamed: 0                    int64  
 1   Flow ID                       object 
 2    Source IP                    object 
 3    Source Port                  int64  
 4    Destination IP               object 
 5    Destination Port             int64  
 6    Protocol                     int64  
 7    Timestamp                    object 
 8    Flow Duration                int64  
 9    Total Fwd Packets            int64  
 10   Total Backward Packets       int64  
 11  Total Length of Fwd Packets   float64
 12   Total Length of Bwd Packets  float64
 13   Fwd Packet Length Max        float64
 14   Fwd Packet Length Min        float64
 15   Fwd Packet Length Mean       float64
 16   Fwd Packet Length Std        float64
 17  Bwd Packet Length Max         float64
 18   Bwd Packet Length M

In [6]:
# Does any column contain NaN values
data1.columns[data1.isna().any(axis=0)]

Index(['Flow Bytes/s'], dtype='object')

In [7]:
# Does any column contain the same value?
data1.columns[data1.nunique() == 1]

Index([' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags', 'FIN Flag Count',
       ' PSH Flag Count', ' ECE Flag Count', 'Fwd Avg Bytes/Bulk',
       ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk',
       ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate'],
      dtype='object')

12 columns contain the same value, probably all 0s.. great dataset indeed

In [8]:
# Just for verification
data1[' Bwd PSH Flags'].value_counts()

0    50063112
Name:  Bwd PSH Flags, dtype: int64

In [9]:
data1[' Label'].value_counts()

TFTP             20082580
DrDoS_SNMP        5159870
DrDoS_DNS         5071011
DrDoS_MSSQL       4522492
DrDoS_NetBIOS     4093279
DrDoS_UDP         3134645
DrDoS_SSDP        2610611
DrDoS_LDAP        2179930
Syn               1582289
DrDoS_NTP         1202642
UDP-lag            366461
BENIGN              56863
WebDDoS               439
Name:  Label, dtype: int64

In [10]:
# Let's look at data2 (test day)
del data1

## Day #2 (Test) Data

In [11]:
# Only work with a subset of test data
data2 = pd.read_csv(ROOT_DIR + '/03-11/UDP.csv')

  data2 = pd.read_csv(ROOT_DIR + '/03-11/UDP.csv')


In [12]:
data2.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3782206 entries, 0 to 3782205
Data columns (total 88 columns):
 #   Column                        Non-Null Count    Dtype  
---  ------                        --------------    -----  
 0   Unnamed: 0                    3782206 non-null  int64  
 1   Flow ID                       3782206 non-null  object 
 2    Source IP                    3782206 non-null  object 
 3    Source Port                  3782206 non-null  int64  
 4    Destination IP               3782206 non-null  object 
 5    Destination Port             3782206 non-null  int64  
 6    Protocol                     3782206 non-null  int64  
 7    Timestamp                    3782206 non-null  object 
 8    Flow Duration                3782206 non-null  int64  
 9    Total Fwd Packets            3782206 non-null  int64  
 10   Total Backward Packets       3782206 non-null  int64  
 11  Total Length of Fwd Packets   3782206 non-null  float64
 12   Total Length of Bwd Packets

In [13]:
data2.head(5)

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,15798,172.16.0.5-192.168.50.4-9401-15931-17,172.16.0.5,9401,192.168.50.4,15931,17,2018-11-03 10:42:57.176671,1,2,0,2560.0,0.0,1280.0,1280.0,1280.0,0.0,0.0,0.0,0.0,0.0,2560000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,1280.0,1280.0,1280.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,1920.0,1280.0,0.0,-2,0,0,0,0,0,0,2,2560,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
1,110891,172.16.0.5-192.168.50.4-9402-29997-17,172.16.0.5,9402,192.168.50.4,29997,17,2018-11-03 10:42:57.176673,0,2,0,816.0,0.0,408.0,408.0,408.0,0.0,0.0,0.0,0.0,0.0,inf,inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,0.0,0.0,408.0,408.0,408.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,612.0,408.0,0.0,-2,0,0,0,0,0,0,2,816,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
2,66956,172.16.0.5-192.168.50.4-9403-29887-17,172.16.0.5,9403,192.168.50.4,29887,17,2018-11-03 10:42:57.176727,1,2,0,810.0,0.0,405.0,405.0,405.0,0.0,0.0,0.0,0.0,0.0,810000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,405.0,405.0,405.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,607.5,405.0,0.0,-2,0,0,0,0,0,0,2,810,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
3,66144,172.16.0.5-192.168.50.4-9404-7393-17,172.16.0.5,9404,192.168.50.4,7393,17,2018-11-03 10:42:57.176729,1,2,0,852.0,0.0,426.0,426.0,426.0,0.0,0.0,0.0,0.0,0.0,852000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,426.0,426.0,426.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,639.0,426.0,0.0,-2,0,0,0,0,0,0,2,852,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
4,72903,172.16.0.5-192.168.50.4-9405-57957-17,172.16.0.5,9405,192.168.50.4,57957,17,2018-11-03 10:42:57.177121,1,2,0,1240.0,0.0,620.0,620.0,620.0,0.0,0.0,0.0,0.0,0.0,1240000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,620.0,620.0,620.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,930.0,620.0,0.0,-2,0,0,0,0,0,0,2,1240,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL


In [14]:
data2.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0.1,Unnamed: 0,Source Port,Destination Port,Protocol,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Inbound
count,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782202.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0,3782206.0
mean,63143.32,46793.48,33300.75,16.9923,90540.14,3.749488,0.007763723,1362.059,5.805188,388.1425,359.5337,372.4167,15.07711,0.2106897,0.02022708,0.07380842,0.05931257,inf,inf,18958.56,30069.7,55238.58,12.65483,90332.81,19088.19,30224.46,55118.13,12.55458,5113.479,277.7776,611.0455,1921.615,0.001793133,0.0001853416,0.0,0.0,0.0,-56030450.0,-561.7487,751446.1,12.19106,359.5324,388.2642,370.2776,14.4435,467.1685,0.0,3.965939e-06,0.0001853416,0.0,0.0001853416,0.0004716824,0.0001821688,0.0,0.0006882227,505.1987,372.4167,0.07380842,-56030450.0,0.0,0.0,0.0,0.0,0.0,0.0,3.749488,1362.059,0.007763723,5.805188,4.233566,0.7263605,2.746263,-23573530.0,75.49161,28.5183,120.767,58.50962,2400.031,125.0781,2517.102,2292.068,0.9990566
std,36575.31,8424.7,18652.94,0.2965105,766143.6,3.573937,1.44243,1201.26,2216.302,30.95188,43.03871,34.48869,16.1596,20.29896,1.436148,6.607057,5.997098,,,55884.53,106837.8,327360.8,4993.744,763303.4,62976.73,119865.7,325293.8,5034.657,681044.2,33247.11,75642.03,268933.8,0.2167218,0.01361276,0.0,0.0,0.0,434923100.0,1092888.0,941116.8,3446.501,43.04087,34.60111,35.59022,16.07961,4145.499,0.0,0.001991463,0.01361276,0.0,0.01361276,0.02171313,0.01349577,0.0,0.03625421,92.53673,34.48869,6.607057,434923100.0,0.0,0.0,0.0,0.0,0.0,0.0,3.573937,1201.26,1.44243,2216.302,371.5303,242.0249,3.445795,156513100.0,15980.06,8481.062,23635.65,14645.51,303930.7,60378.65,319717.2,296480.8,0.03069976
min,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03456459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-89268390000.0,-2125438000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-89268390000.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.0,-1.0,0.0,-1062719000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,31518.0,39854.0,17211.0,17.0,1.0,2.0,0.0,766.0,0.0,375.0,330.0,349.5,0.0,0.0,0.0,0.0,0.0,12985.08,36.78161,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0,36.72285,0.0,330.0,375.0,345.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,432.0,349.5,0.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,766.0,0.0,0.0,-1.0,-1.0,1.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,63036.0,46908.0,33288.0,17.0,51.0,2.0,0.0,1146.0,0.0,389.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,15019610.0,40000.0,50.0,0.0,51.0,1.0,51.0,50.0,0.0,50.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,118.2208,0.0,375.0,389.0,375.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,515.5,375.0,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1146.0,0.0,0.0,-1.0,-1.0,1.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,94555.0,53899.0,49483.0,17.0,108773.0,4.0,0.0,1438.0,0.0,393.0,383.0,383.0,34.06367,0.0,0.0,0.0,0.0,766000000.0,2000000.0,36250.0,61202.59,108729.0,1.0,108773.0,36250.0,61202.3,108729.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,72.0,0.0,2000000.0,0.0,383.0,393.0,383.0,32.31563,1044.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,574.5,383.0,0.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1438.0,0.0,0.0,-1.0,-1.0,3.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,132317.0,65531.0,65535.0,17.0,119995600.0,922.0,1347.0,102472.0,2466308.0,3495.0,1472.0,1472.0,1405.156,3621.0,327.0,1830.964,1036.555,inf,inf,38575130.0,66814080.0,115725400.0,7009992.0,119995600.0,38575130.0,66814080.0,115725400.0,7009992.0,119618900.0,20526330.0,41822300.0,95113340.0,97.0,1.0,0.0,0.0,0.0,30366.0,43120.0,4000000.0,2000000.0,1472.0,3621.0,1472.0,1045.965,1094043.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,10.0,2208.0,1472.0,1830.964,30366.0,0.0,0.0,0.0,0.0,0.0,0.0,922.0,102472.0,1347.0,2466308.0,65535.0,65535.0,216.0,773.0,10863880.0,8246154.0,16130150.0,10863880.0,115725400.0,61880060.0,115725400.0,115725400.0,1.0


In [15]:
# Label distribution - should contain only 1 type of attack due to loading a single file
data2[' Label'].value_counts()

 Label
UDP       3754680
MSSQL       24392
BENIGN       3134
Name: count, dtype: int64

Let's take a look on the whole day 2 - unfortunately, we made only a very brief analysis back in the day when we were able to load it with better hardware.

### Less Detailed Analysis on Full Data

In [None]:
data2 = pd.read_csv(PATH_DAY2)

In [12]:
data2.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20364525 entries, 0 to 20364524
Columns: 88 entries, Unnamed: 0 to  Label
dtypes: float64(45), int64(37), object(6)
memory usage: 13.4+ GB


In [13]:
# Any NaN columns
data2.columns[data2.isna().any(axis=0)]

Index(['Flow Bytes/s'], dtype='object')

In [14]:
# Any same-value columns?
data2.columns[data2.nunique() == 1]

Index([' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags', 'FIN Flag Count',
       ' PSH Flag Count', ' ECE Flag Count', 'Fwd Avg Bytes/Bulk',
       ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk',
       ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate'],
      dtype='object')

Same as in the train set - the features can safely be dropped. Although, it is questionable whether they have to be 0s or used CICFlowMeter version was flawed.

In [15]:
# Test data label distribution
data2[' Label'].value_counts()

MSSQL      5787453
Syn        4891500
UDP        3867155
NetBIOS    3657497
LDAP       1915122
Portmap     186960
BENIGN       56965
UDPLag        1873
Name:  Label, dtype: int64

In [16]:
# How many attacking IPs are there?
data2[data2[' Label'] != 'BENIGN'][' Source IP'].value_counts()

172.16.0.5      20299481
192.168.50.4        8079
Name:  Source IP, dtype: int64

Most of the attacks from the Same IP... Not really a DDoS dataset after all, is it?

In [35]:
# Aren't there more packets in flows than in real PCAP?
data2[' Total Fwd Packets'].sum() + data2[' Total Backward Packets'].sum()

63287147

In [38]:
!tcpdump -r '/data/kinit/gold_data/cic-ddos2019/PCAPs/03-11.pcap' 2> /dev/null | wc -l

61494280


Packet counting

```for FILE in *.pcap; do echo "$FILE" `tshark -r $FILE 2> /dev/null | wc -l`; done ```

- 01-12_1.pcap 87534557
- 01-12_2.pcap 50886014
- 01-12_3.pcap 88461786
- 01-12_4.pcap 23905530
- 03-11.pcap 61408736


Apparently, used feature extraction seems to be broken, as more packets are captured within the flows as in the underlying PCAP files...

Continuity analysis was performed directly with the PCAP files.