# CIC-IDS2017 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
FOLDER_CSV = '/data/surv/cic-ids2017/csvs_labeled_flows'
FOLDER_PCAPS = '/data/surv/cic-ids2017/pcaps'

## CSVs Analysis

In [3]:
labeled_flows_fullpaths = [os.path.join(FOLDER_CSV, file) for file in os.listdir(FOLDER_CSV) if file.endswith('.csv')]
labeled_flows_fullpaths

['/data/surv/cic-ids2017/csvs_labeled_flows/.Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv.swp',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Friday-WorkingHours-Morning.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Tuesday-WorkingHours.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Monday-WorkingHours.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Wednesday-workingHours.pcap_ISCX.csv',
 '/data/surv/cic-ids2017/csvs_labeled_flows/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv']

In [None]:
# Load all the files
data = pd.DataFrame()

for flowfile in labeled_flows_fullpaths:
    print(flowfile)
    data = pd.concat([data, pd.read_csv(flowfile, encoding ='latin1')])

In [7]:
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 3119345 entries, 0 to 286466
Data columns (total 85 columns):
 #   Column                        Non-Null Count    Dtype  
---  ------                        --------------    -----  
 0   Flow ID                       2830743 non-null  object 
 1    Source IP                    2830743 non-null  object 
 2    Source Port                  2830743 non-null  float64
 3    Destination IP               2830743 non-null  object 
 4    Destination Port             2830743 non-null  float64
 5    Protocol                     2830743 non-null  float64
 6    Timestamp                    2830743 non-null  object 
 7    Flow Duration                2830743 non-null  float64
 8    Total Fwd Packets            2830743 non-null  float64
 9    Total Backward Packets       2830743 non-null  float64
 10  Total Length of Fwd Packets   2830743 non-null  float64
 11   Total Length of Bwd Packets  2830743 non-null  float64
 12   Fwd Packet Length Max        2830

In [10]:
len(data)

3119345

In [8]:
data[' Label'].value_counts()

 Label
BENIGN                        2273097
DoS Hulk                       231073
PortScan                       158930
DDoS                           128027
DoS GoldenEye                   10293
FTP-Patator                      7938
SSH-Patator                      5897
DoS slowloris                    5796
DoS Slowhttptest                 5499
Bot                              1966
Web Attack  Brute Force         1507
Web Attack  XSS                  652
Infiltration                       36
Web Attack  Sql Injection         21
Heartbleed                         11
Name: count, dtype: int64

There is a space in a feature name, there are unprintable characters as the labels... Data misses IP addresses and the source port... Says a lot about the quality of the data.

In [9]:
data.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,Source Port,Destination Port,Protocol,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min
count,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2829385.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0
mean,41128.86,8071.483,9.880341,14785660.0,9.36116,10.39377,549.3024,16162.64,207.5999,18.71366,58.20194,68.91013,870.8495,41.04958,305.9493,335.3257,inf,inf,1298449.0,2919271.0,9182475.0,162379.6,14482960.0,2610193.0,3266957.0,9042939.0,1021893.0,9893830.0,1805784.0,1485973.0,4684692.0,967261.4,0.04644646,0.0,0.0001112782,0.0,-25997.39,-2273.275,63865.35,6995.192,16.4345,950.4024,171.9444,294.9756,486154.8,0.03537976,0.04644646,0.0002423392,0.2980705,0.3158443,0.09482316,0.0001112782,0.000243399,0.6835004,191.9837,58.20194,305.9493,-25997.39,0.0,0.0,0.0,0.0,0.0,0.0,9.36116,549.2919,10.39377,16162.3,6989.837,1989.433,5.418218,-2741.688,81551.32,41134.12,153182.5,58295.82,8316037.0,503843.9,8695752.0,7920031.0
std,22294.94,18283.63,5.261922,33653740.0,749.6728,997.3883,9993.589,2263088.0,717.1848,60.33935,186.0912,281.1871,1946.367,68.8626,605.2568,839.6932,,,4507944.0,8045870.0,24459540.0,2950282.0,33575810.0,9525722.0,9639055.0,24529160.0,8591436.0,28736610.0,8887197.0,6278469.0,17160950.0,8308983.0,0.21045,0.0,0.01054826,0.0,21052860.0,1452209.0,247537.1,38151.7,25.23772,2028.229,305.4915,631.8001,1647490.0,0.1847378,0.21045,0.01556536,0.4574107,0.4648513,0.2929706,0.01054826,0.01559935,0.680492,331.8603,186.0912,605.2568,21052860.0,0.0,0.0,0.0,0.0,0.0,0.0,749.6728,9980.07,997.3883,2263057.0,14338.73,8456.883,636.4257,1084989.0,648599.9,393381.5,1025825.0,577092.3,23630080.0,4602984.0,24366890.0,23363420.0
min,0.0,0.0,0.0,-13.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-261000000.0,-2000000.0,-13.0,0.0,-13.0,-14.0,0.0,0.0,0.0,0.0,-12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-32212230000.0,-1073741000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-32212230000.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.0,-1.0,0.0,-536870700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,32774.0,53.0,6.0,155.0,2.0,1.0,12.0,0.0,6.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,119.3197,3.446226,63.66667,0.0,123.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,20.0,1.749446,0.1229197,0.0,6.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.5,6.0,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0,1.0,0.0,-1.0,-1.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,50944.0,80.0,6.0,31316.0,2.0,2.0,62.0,123.0,37.0,2.0,34.0,0.0,79.0,0.0,72.0,0.0,4595.549,110.6684,11438.84,137.1787,30865.0,4.0,43.0,26.0,0.0,43.0,3.0,3.0,3.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,64.0,40.0,61.32524,19.82789,2.0,87.0,57.2,25.98076,675.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,72.25,34.0,72.0,64.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,62.0,2.0,123.0,251.0,-1.0,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,58413.0,443.0,17.0,3204828.0,5.0,4.0,187.0,482.0,81.0,36.0,50.0,26.16295,280.0,77.0,181.0,77.94054,166666.7,23255.81,337426.6,691266.3,2440145.0,64.0,1242844.0,206306.4,65989.82,931006.0,48.0,98580.5,18248.57,15724.09,60210.0,45.0,0.0,0.0,0.0,0.0,120.0,104.0,12048.19,7352.941,36.0,525.0,119.8,174.3239,30388.84,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,149.2639,50.0,181.0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,187.0,4.0,482.0,8192.0,235.0,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,65535.0,65535.0,17.0,120000000.0,219759.0,291922.0,12900000.0,655453000.0,24820.0,2325.0,5940.857,7125.597,19530.0,2896.0,5800.5,8194.66,inf,inf,120000000.0,84800260.0,120000000.0,120000000.0,120000000.0,120000000.0,84602930.0,120000000.0,120000000.0,120000000.0,120000000.0,84418010.0,120000000.0,120000000.0,1.0,0.0,1.0,0.0,4644908.0,5838440.0,3000000.0,2000000.0,1448.0,24820.0,3337.143,4731.522,22400000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,156.0,3893.333,5940.857,5800.5,4644908.0,0.0,0.0,0.0,0.0,0.0,0.0,219759.0,12870340.0,291922.0,655453000.0,65535.0,65535.0,213557.0,138.0,110000000.0,74200000.0,110000000.0,110000000.0,120000000.0,76900000.0,120000000.0,120000000.0


## PCAP Analysis

In [5]:
pcap_fullpahts = [os.path.join(FOLDER_PCAPS, file) for file in os.listdir(FOLDER_PCAPS)]
pcap_fullpahts

['/data/surv/cic-ids2017/pcaps/Wednesday-workingHours.pcap',
 '/data/surv/cic-ids2017/pcaps/Monday-WorkingHours.pcap',
 '/data/surv/cic-ids2017/pcaps/Thursday-WorkingHours.pcap',
 '/data/surv/cic-ids2017/pcaps/Tuesday-WorkingHours.pcap',
 '/data/surv/cic-ids2017/pcaps/Friday-WorkingHours.pcap']

In [6]:
# Get general capinfos data for each file
for pcap_filepath in pcap_fullpahts:
    !capinfos $pcap_filepath
    !echo ""

File name:           /data/surv/cic-ids2017/pcaps/Wednesday-workingHours.pcap
File type:           Wireshark/... - pcapng
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: (not set)
Number of packets:   13 M
File size:           13 GB
Data size:           12 GB
Capture duration:    30457.696353 seconds
First packet time:   2017-07-05 13:42:42.084372
Last packet time:    2017-07-05 22:10:19.780725
Data byte rate:      425 kBps
Data bit rate:       3,404 kbps
Average packet size: 939.95 bytes
Average packet rate: 452 packets/s
SHA256:              cd2674db7559a53f24bc03be3239b315700174ccaef72d10f5edc4c1a08f6186
RIPEMD160:           aff58953d580dfc856f9fafa3d62993310e93dce
SHA1:                bc36902fc69f4f79e8d6ae1f2ff8751ea93d495b
Strict time order:   False
Capture oper-sys:    Linux 4.8.0-22-generic
Capture application: mergecap
Number of interfaces in file: 1
Interface #0 info:
                     Encapsulation = Ethernet (1 - e

In [7]:
# Count the toal number of packets
!/data/AAAA_SCRIPTS/countpkts.sh $FOLDER_PCAPS

Processing /data/surv/cic-ids2017/pcaps/Friday-WorkingHours.pcap
Processing /data/surv/cic-ids2017/pcaps/Monday-WorkingHours.pcap
Processing /data/surv/cic-ids2017/pcaps/Thursday-WorkingHours.pcap
Processing /data/surv/cic-ids2017/pcaps/Tuesday-WorkingHours.pcap
Processing /data/surv/cic-ids2017/pcaps/Wednesday-workingHours.pcap

TOTAL FILES : 5
TOTAL PACKETS: 56370702


Original CIC-IDS2017 dataset contains no timestamp information, so data continuity had to be verified via PCAP analysis.