## CSE-CIC-IDS2018

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [2]:
# Data path
PATH = '/data/datasets_cic_impr/cse_cicids2018'

files = os.listdir(PATH)
files

['2018-02-16-Friday.csv',
 '2018-02-14-Wednesday.csv',
 '2018-02-21-Wednesday.csv',
 '2018-02-28-Wednesday.csv',
 '2018-03-02-Friday.csv',
 '2018-02-23-Friday.csv',
 '2018-02-15-Thursday.csv',
 '2018-02-20-Tuesday.csv',
 '2018-03-01-Thursday.csv',
 '2018-02-22-Thursday.csv']

In [5]:
for file in files:
    filepath = os.path.join(PATH, file)

    !wc -l $filepath

7390267 /data/datasets_cic_impr/cse_cicids2018/2018-02-16-Friday.csv
5898351 /data/datasets_cic_impr/cse_cicids2018/2018-02-14-Wednesday.csv
6962594 /data/datasets_cic_impr/cse_cicids2018/2018-02-21-Wednesday.csv
6568727 /data/datasets_cic_impr/cse_cicids2018/2018-02-28-Wednesday.csv
6311372 /data/datasets_cic_impr/cse_cicids2018/2018-03-02-Friday.csv
5976482 /data/datasets_cic_impr/cse_cicids2018/2018-02-23-Friday.csv
5410103 /data/datasets_cic_impr/cse_cicids2018/2018-02-15-Thursday.csv
6054703 /data/datasets_cic_impr/cse_cicids2018/2018-02-20-Tuesday.csv
6551402 /data/datasets_cic_impr/cse_cicids2018/2018-03-01-Thursday.csv
6071154 /data/datasets_cic_impr/cse_cicids2018/2018-02-22-Thursday.csv


In [None]:
# We load only 3 first files due to the size
data = pd.DataFrame()

for file in files:
    filepath = os.path.join(PATH, file)

    print(file)
    data = pd.concat([data, pd.read_csv(filepath)])

In [5]:
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63195145 entries, 0 to 6071152
Data columns (total 91 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   id                          int64  
 1   Flow ID                     object 
 2   Src IP                      object 
 3   Src Port                    int64  
 4   Dst IP                      object 
 5   Dst Port                    int64  
 6   Protocol                    int64  
 7   Timestamp                   object 
 8   Flow Duration               int64  
 9   Total Fwd Packet            int64  
 10  Total Bwd packets           int64  
 11  Total Length of Fwd Packet  int64  
 12  Total Length of Bwd Packet  int64  
 13  Fwd Packet Length Max       int64  
 14  Fwd Packet Length Min       int64  
 15  Fwd Packet Length Mean      float64
 16  Fwd Packet Length Std       float64
 17  Bwd Packet Length Max       int64  
 18  Bwd Packet Length Min       int64  
 19  Bwd Packet Length Me

In [6]:
data['Label'].value_counts()

BENIGN                                          59353486
DoS Hulk                                         1803160
DDoS-HOIC                                        1082293
FTP-BruteForce - Attempted                        298874
DDoS-LOIC-HTTP                                    289328
Botnet Ares                                       142921
SSH-BruteForce                                     94197
Infiltration - NMAP Portscan                       89374
DoS GoldenEye                                      22560
DoS Slowloris                                       8490
DoS GoldenEye - Attempted                           4301
DDoS-LOIC-UDP                                       2527
DoS Slowloris - Attempted                           2280
Botnet Ares - Attempted                              262
DDoS-LOIC-UDP - Attempted                            251
Infiltration - Communication Victim Attacker         204
Web Attack - Brute Force - Attempted                 137
Web Attack - Brute Force       

In [7]:
len(data)

63195145

In [9]:
data[data['Attempted Category'] == -1]['Label'].value_counts()

BENIGN                                          59353486
DoS Hulk                                         1803160
DDoS-HOIC                                        1082293
DDoS-LOIC-HTTP                                    289328
Botnet Ares                                       142921
SSH-BruteForce                                     94197
Infiltration - NMAP Portscan                       89374
DoS GoldenEye                                      22560
DoS Slowloris                                       8490
DDoS-LOIC-UDP                                       2527
Infiltration - Communication Victim Attacker         204
Web Attack - Brute Force                             131
Web Attack - XSS                                     113
Infiltration - Dropbox Download                       85
Web Attack - SQL                                      39
Name: Label, dtype: int64

Unfortunately, only this brief class analysis was initially performed. Nevertheless, our subsequent experience with Improved CIC-IDS2018 shows much better overall data quality than the original CIC-IDS2018, containing data inconsistencies, as shown in the corresponding notebook. Time continuity analysis was not performed due to being the same as the original CIC-IDS2018 data.